From signal representation to representation learning: structured modeling of speech signals.
Nicolas Obin is pleased to invite you to the defense of his Habilitation à Diriger des Recherches (HDR), which will take place on Tuesday, September 12, 2023 at 2:00 pm, in the Salle Stravinsky at Ircam, and will also be streamed on YouTube at the following link: https://youtube.com/live/GLDJfD-OTrY
The presentation will be in French. On-site access within the limits of avalaible places.
Mr Thomas HUEBER, CNRS Research Director, GIPSA lab, Rapporteur
Mr Emmanuel VINCENT, INRIA Research Director, MultiSpeech, Rapporteur
Mr Bjorn SCHULLER, Professor, Imperial College London, Rapporteur
Mr Gérard BIAU, Professor, Sorbonne University, Examiner
Mr. Jean-François BONASTRE, INRIA Research Director, Defense and Security, Examiner
Ms Catherine PELACHAUD, CNRS Research Director, ISIR, Examiner
Mr Axel ROEBEL, Research Director, IRCAM, Examiner
Ms Isabel TRANCOSO, Professor, INESC - University of Lisbon, Examiner
Mr Nicolas BECKER, Sound Designer and Artist, Guest Member
This habilitation presents the last ten years of my research on the structured modelling of speech signals. Speech, as an oral language, constitutes the most elaborate communication system observed to date, characterized by a multidimensionality that is at once temporal, parametric and factorial. Its study mobilises numerous scientific fields such as signal and information processing, machine learning, linguistics, psychology, sociology and even anthropology. In addition to its linguistic functions, speech reveals information about an individual, such its biometry (identity), physiology (gender/age, weight/height, health, etc.), psychologic (emotional state, social attitude, personality, etc.), stylistic (adaptation to audience and communication channel), and cultural (geographical origins, socio-professional status). The main problem in modelling speech signals is that the factors of variability are not directly accessible to the observation, but are intricate in the speech signal in a complex and ambiguously manner. The challenge for automatic speech processing is therefore to be able to identify and disentangle the factors of variability in speech signals, in particular through the statistical observation of regularities in databases.
My research is mainly focused on the problem of identifying and modelling the variability factors related to the stylistics and expressivity of spoken communication. In particular, I have explored the use of machine learning to analyze, model and generate speech signals. The main challenge of my research is to resolve ambiguities in the speech signal by learning, from a limited amount of data, structured representations that encode the information associated with the various factors of variability under consideration (such as identity, style or expressivity). This research is articulated around three main axes: 1) cognition, i.e., mental representations of the human voice and their similarity; 2) perception, i.e., the human ability to separate and localise sound sources; and finally 3) generation, i.e., how to create or manipulate the identity or expressivity of real or artificial human voices. I will outline the transition from a signal paradigm to a learning paradigm: this phenomenon has manifested itself in the field of speech synthesis through a three-stage evolution, from unit selection speech synthesis, to multi-parametric statistical modelling, and to neural speech generation from compressed and incomplete representations. This paradigm shift can be explained by the limitations of traditional signal models for the analysis and synthesis of speech - particularly for expressive speech; and by the historical duality of signal model and learning model, which separates the signal models and their representation from the learning models. The emergence of deep neural networks has made it possible to overcome this duality by learning representations during the learning process.
The issue of data is paramount and conditions all learning problems. At one end of the spectrum, the abundance of data counterbalances the lack of human knowledge specification in learning models; at the other end of the spectrum, some models - for example, physicals - are entirely specified by human knowledge and don't need data for learning. Between these two poles, there is an intermediate position associating human knowledge specification and data-driven machine learning.
The main conclusions of my research support the idea of a necessary cooperation between the two poles of human knowledge and machine learning, in particular through the formulation of structured learning models based on human knowledge. In this case, while speech generation has largely solved the problems of intelligibility and naturalness, speech still resists human knowledge and machines, and new challenges are opening up for research.
Future directions to be explored include the expressive and aesthetic functions of speech - and, by natural extension, of interpretation -, speech-gesture multimodality in human behaviour, the modelling of verbal and non-verbal communication, situated and in context, and, more broadly, learning models that are economical in both hardware and algorithmic resources, and respectful of personal data.
This habilitation will be accompanied by numerous sound illustrations from my research and its creative and artistic applications.