Gestuelle multimodale expressive avec style
Mireille Fares, PhD student at Sorbonne University supported by the Sorbonne Center for Artificial Intelligence (SCAI), will defend her thesis entitled "Multimodal Expressive Gesturing With Style" carried out within the PIROS team at ISIR and the Sound Analysis and Synthesis team of the STMS laboratory under the supervision of Catherine Pelachaud (ISIR) and Nicolas Obin (STMS).
Before the jury composed of:
M. Thierry ARTIÈRES, Professeur, École Centrale Marseille, Reviewer
Mme Chloé CLAVEL, Professeure, Institut Polytechnique de Paris, Examiner
M. Michael NEFF, Professeur, University of California, Reviewer
M. Nicolas OBIN, Maître de Conférences, Sorbonne Université, Examiner
Mme Catherine PELACHAUD, Professeure, Sorbonne Université, Examiner
M. Brian RAVENET, Maître de Conférences, Université Paris-Saclay, Examiner
Mme Laure SOULIER, Maîtresse de Conférences, Sorbonne Université, Examiner
Human communication is essentially and inherently multimodal, it encompasses a gestalt of multimodal signals that involve much more than the speech production system. Primarily, the verbal and non-verbal communication modes are inextricably and jointly intertwined to deliver the semantic and pragmatic content of the message and tailor the communication process. These exchanged multimodal signals involve both vocal and visual channels which, when combined, render the communication more expressive. The vocal mode is characterized by acoustic features - namely prosody - while the visual mode involves facial expressions, hand gestures and body gestures. The evolving virtual and online communication created the need for generating expressive communication for human-like embodied agents, including Embodied Conversational Agents (ECA) and social robots. One crucial communicative signal for ECAs, that can convey a wide range of messages is visual (facial and body) motion that accompanies speech and its semantic content. The generation of appropriate and coherent gestures allows ECAs to articulate the speech intent and content in a human-like expressive fashion.
The central theme of the thesis is to leverage and control the ECAs’ behavioral expressivity by modelling the complex multimodal behavior that humans employ during communication. Concretely, the driving forces of this thesis are twofold: (1) to exploit speech prosody, visual prosody and language with the aim of synthesizing expressive and human-like behaviors for ECAs; (2) to control the style of the synthesized gestures such that we can generate them with the style of any speaker. With these motivations in mind, we first propose a semantically-aware and speech-driven facial and head gesture synthesis model trained on a corpus that we collected from TEDx talks. Then we propose ZS-MSTM 1.0, an approach that allows the synthesis of stylized upper-body gestures, driven by the content of a source speaker’s speech (audio and text) and corresponding to the style of any target speakers, seen or unseen by our model. ZS-MSTM 1.0 is trained on PATS corpus which includes multimodal data of speakers having different behavioral style, however our model is not limited to PATS speakers, and can generate gestures in the style of any newly coming speaker without further training or fine-tuning, rendering our approach zero-shot. More specifically, behavioral style is modelled based on multimodal speakers’ data - language, body gestures, and speech -, and independent from the speaker’s identity ("ID"). We additionally extend this model and propose ZS-MSTM 2.0, which generates stylized facial gestures in addition to the upper-body gestures. We train ZS-MSTM 2.0 on PATS corpus, which we extended to include dialog acts and 2D facial landmarks aligned with the other multimodal features of this dataset (2D body poses, language, and speech).