EVA

  • Sound Workshop

Describing a voice in a few words is an abstract task. We can talk about a “deep,” “breathy,” or “hoarse” voice, but characterizing a voice would require a limited set of rigorously defined attributes that form an ontology. However, no such descriptive framework exists.

Machine learning applied to speech suffers from the same weakness: in most automatic processing tasks, the speaker is modeled by abstract, general representations with characteristics that are not explicit or only minimally so. For example, automatic speaker identification is generally approached using the x-vector paradigm, which consists of describing a speaker's voice using an embedding specially designed for this task. Despite their high accuracy, x-vectors are generally unsuitable for detecting similarities between different voices with common characteristics.

The same observations apply to speech generation: speech synthesis is generally controlled by injecting the speaker's style or identity via unstructured representations. These representations make it possible to bypass the task of defining and learning ontologies, but they only allow a subset of the characteristics of a voice (gender, fundamental frequency, rhythm, intensity) to be imitated without explicitly stating the attributes. They also remain limited by their inability to generate new, original voices.

The goal of this project is to decipher the codes of human voices by learning explicit and structured representations of voice attributes. Achieving this goal will have a strong scientific and technological impact in at least two areas of application: first, in speech analysis, it will provide an understanding of the complex intertwining of human voice characteristics; secondly, for voice generation, it will fuel a wide range of applications to create a voice with the desired attributes, enabling the design of what is known as a vocal personality.

The set of attributes will be defined by human expertise or discovered from data using lightly supervised, unsupervised, or unsupervised neural networks. It will include a detailed and explicit description of timbre, voice quality, phonation, speaker biases such as specific pronunciations or speech disorders (e.g., lisping), regional or non-native accents, and paralinguistic elements such as emotions or style. Ideally, each attribute could be controlled in synthesis and conversion by a degree of intensity, allowing it to be amplified or removed from the voice as part of structured integration. These new attributes could be defined by experts or by neural network algorithms such as automatic voice disentanglement or self-supervised representations that would automatically discover salient attributes in multi-speaker datasets.

The main industrial results expected concern different use cases for voice transformation. The first is voice anonymization: to enable GDPR-compliant voice recordings, voice conversion systems could be configured to remove attributes strongly associated with a speaker's identity, while other attributes would remain unchanged to preserve the intelligibility, naturalness, and expressiveness of the manipulated voice. The second is voice creation: new voices could be sculpted from a set of desired attributes to fuel the creative industry.

En poursuivant votre navigation sur ce site, vous acceptez l'utilisation de cookies pour nous permettre de mesurer l'audience, et pour vous permettre de partager du contenu via les boutons de partage de réseaux sociaux. En savoir plus.