Abstracts of contributions
Yves Laprie : Prediction of the geometric shape of the vocal tract from the sequence of phonemes to be articulated
The presentation will focus on the prediction of the geometric shape of the vocal tract from a sequence of phonemes. It will begin by presenting the different approaches that have been used in the past, particularly those based on the use of articulatory models, in order to provide an overview of the issues and difficulties. The presentation will then focus on the use of dynamic MRI to capture articulatory gestures. As cinéIRM cannot be exploited directly, we will present the automatic articulator tracking tools with their limitations. Then we will present the deep learning approach to predict the geometric shape of the vocal tract in the medio-sagittal plane according to the phoneme sequence to be articulated.
Yves Laprie is a CNRS researcher at LORIA in Nancy. His research focuses on articulatory synthesis and modeling, speech analysis and language learning. In the last few years, he has worked mainly on the exploitation of real-time IMR data.
Thomas Hueber : Acoustic-articulatory modeling: from assistive technologies to the study of speech development mechanisms
Speech production is a complex motor process involving several physiological phenomena, such as the neural, nervous and muscular activities that drive our respiratory, laryngeal and articulatory movements. Modeling speech production, in particular the relationship between articulatory gestures (tongue, lips, jaw, velum) and acoustic realizations of speech, is a challenging, and still evolving, research question. From an applicative point of view, such models could be embedded into assistive devices able to restore oral communication when part of the speech production chain is damaged (articulatory synthesis). They could also help rehabilitate speech sound disorders using a therapy based on biofeedback (and articulatory inversion). From a more fundamental research perspective, such models can also be used to question the cognitive mechanisms underlying speech perception and motor control. In this talk, I will present different studies conducted in our group, aiming at learning acoustic-articulatory models from real-world data, using (deep, but not only) machine learning. First, I will focus on different attempts to adapt a direct or inverse model, pre-trained on a reference speaker, to any new speaker. Then, I will present a recent work on the integration of articulatory priors into the latent space of a variational auto-encoder, for potential application to speech enhancement. Finally, I will describe a recent line of research aiming at studying, through modeling and simulation, how a child learns the acoustic-to-articulatory inverse mapping in a self-supervised manner when repeating auditory-only speech stimuli.
Thomas Hueber is a CNRS research director at GIPSA-Lab in Grenoble, France, and head of the CRISSP (Cognotive Robotics, Interactive Systems, Speech Processing) research team. His work focuses on automatic speech processing, with a particular interest in multimodal approaches (audio-visual) and human biological signals related to speech production (e.g. articulatory, muscular and brain signals).
Nathalie Henrich : From source-filter theory to pneumo-phono-resonant interactions: the complexity of the human voice
For more than half a century, source-filter theory has remained at the heart of the modeling, analysis and synthesis of the human voice and its expressions, such as speech and singing. this theory and its understanding of human voice production will be the subject of this presentation. Finally, Nathalie Henrich will show how the diversity of phonatory and articulatory gestures requires a rethinking of this model by including levels of interaction that she will detail.
Nathalie Henrich, a scientist with a passion for the human voice in all its forms of expression, she is Director of Reasearch at the CNRS in the Institute of Human and Social Sciences (INSHS), Language Sciences Section. Her research projects focus on experimental and clinical phonetics of speech and singing, on the physiological and physical characterization of vocal techniques (lyrical singing, amplified singing, world singing), and on the development of non-invasive experimental techniques and mechatronic vocal avatars. She coordinated the World Voice Day in France (April 2022). In 2013, the CNRS awarded her the bronze medal for her work in vocology.
Axel Roebel : Deep learning methods for voice processing: Neural Vocoding for Voice Transformation
In recent years, the situation in speech synthesis and processing has been dominated by data-driven methods and deep neural networks. The use of ever increasing amounts of data allows the exploitation of even more parameters. This leads to continuous improvements. Unfortunately, the increasing computational complexity hinders the widespread application of these models. The first part of the talk will focus on research on data and computationally efficient voice transformation using deep neural network integrating a WaveNet into a classical source filter model. The discussion will motivate the structure of the model and the training losses. The deficiencies of the proposed model will lead to a brief reflection on the prospects, given the rapid evolution in neural vocoding. The second part will discuss ongoing research on applications of the neural vocoder, combining it with models dedicated to intensity, pitch, expressiveness or identity transformation.
Axel Roebel is a Research director at IRCAM and head of the Analysis/Synthesis team. His research activities are focused on voice and music synthesis and transformation, with a strong emphasis on artistic and industrial applications. After many years of research on various signal processing algorithms, he turned to data-driven methods.