Neural Conversion of Social Attitudes in Speech Signals
The thesis "Neuronal Conversion of Social Attitudes in Speech Signals" by Clément Le Moine-Veillon is co-financed by the Île de France region and the Stellantis automotive group within the MoVE project "Modelisation of Voice Expressivity". It is directed by Axel Roebel and supervised by Nicolas Obin in the sound analysis-synthesis team of the STMS laboratory (Ircam-CNRS-Sorbonne University-Ministry of Culture).
Clément Le Moine-Veillon will defend his thesis on February 27, 2023 in the Ircam building at 3pm.
This thesis will be at Ircam, but you can follow it through the YouTube Ircam Channel: https://youtube.com/live/6ocHIjbDQuE
Thomas Hueber, Chargé de recherche CNRS, GIPSA-lab Grenoble (rapporteur)
Damien Lolive, Professeur, IRISA, Université de Rennes 1 (rapporteur)
Berrak Sisman, Professeure Associée, Université du Texas
Catherine Pelachaud, Directrice de recherche CNRS, ISIR, Sorbonne Université
Carlos Busso, Professeur, Université du Texas
Jaime Lorenzo Trueba, Chercheur, Amazon
When communicating vocally, humans transmit a set of social signals that considerably enrich the meaning communicated. The speaker's social attitudes - at the heart of this process - are the focus of this research, which aims at developing neural algorithms for their conversion. Our main contributions are: the creation of a French database for social attitudes of speech; the identification of production strategies and biases in the perception of social attitudes; the development of a BWS-Net - an algorithm mimicking human perception of social attitudes; a first conversion algorithm based on a multiscale modeling of F0 contours; a second conversion algorithm based on the Transformer, learned on mel-spectrogram representations of the speech signal and linguistically conditioned by a speech recognition module. These contributions are detailed in the following abstract.
The initial step of this work was the creation of a multi-speaker database in French - Att-HACK - consisting of about 30 hours of expressive speech dedicated to four social attitudes: friendliness, distance, dominance and seduction. This database provided us with the material to understand how these attitudes are vocally communicated. First, an acoustic analysis of the collected data based on an understanding of the anatomical mechanisms of speech production allowed us to identify strategies common to French speakers and to reveal prototypical profiles of attitude production. Secondly, a study based on a Best-Worst-Scaling (BWS) experiment conducted on a hundred subjects, allowed us to evaluate the perception of attitudes produced in Att-HACK, highlighting significant interactions with the linguistic content or with the gender of the speaker.
After showing the existence of humanly perceptible invariants within our data, we worked on the development of algorithms capable of capturing these invariants through the objective - explicit or implicit - of attitude recognition. In particular, we developed a BWS-Net - a perceptual evaluation algorithm of the communicated attitude - trained on the judgments of participants in the BWS experiment. This algorithm allowed us to extend the validation of Att-HACK to untested data, to identify in particular the sounds for which the attitude is poorly communicated and thus to provide clean data for learning conversions.
Intonation - represented by variations in fundamental frequency, or F0 - was found to be central to the communication of the social attitudes investigated in the two studies mentioned above. We therefore initially sought to convert this single parameter by modeling its variations at different temporal scales - from micro to macro prosody - using a neural layer allowing the learning of Continuous Wavelet Transform (CWT) representations. We have proposed an end-to-end algorithm in which the decomposition of the F0 signal and the conversion - via Dual-GAN - of the resulting representations are learned jointly by pairs of attitudes. Objective measurements and a subjective listening test were used to validate the performance of this model for two different speakers. These first results highlighted the difficulties inherent to the use of a parametric representation of the speech signal (intrinsic coherence of the converted signal, naturalness of the conversion) and led us to opt for a complete, compact and perceptually relevant representation of the speech signal for conversion learning: the mel-spectrogram.
Based on the lessons learned from this initial proposal, we worked on the development of a more ambitious algorithm based on the Transformer architecture, linguistically conditioned by a speech recognition module and allowing the simultaneous learning of conversions between the four Att-HACK attitudes. Objective measurements and a subjective listening test validated the performance of this model in single-speaker conversion. Experiments in multi-speaker as well as with attitudinal intensity control based on the incorporation of a BWS-Net show promising first results.