On Temporal Constraints for Deep Neural Voice Alignment
Yann Teytaut completed his thesis "On Temporal Constraints for Deep Neural Voice Alignment" as part of the Sound Analysis and Synthesis team at the STMS Laboratory (Ircam-Sorbonne University-CNRS-Ministry of Culture) and the École doctorale Informatique, télécommunications et électronique de Paris. His research work was funded by the ANR ARS project ( http://ars.ircam.fr/).
He invites you to attend his thesis defense at Ircam on Friday July 7 at 9:30 am, or to follow him live on Ircam's YouTube channel https://youtube.com/live/O5RWUl_vZ9M . The presentation will be in English.
The Jury members will be:
- Pr. Gaël Richard, Professor, Télécom Paris — Reviewer
- Dr. Emmanouil Benetos, Reader, Queen Mary University of London (QMUL) — Reviewer
- Pr. Jean-Pierre Briot, Research Director, LIP6 (CNRS/SU) — Examiner
- Dr. Emmanuel Vincent, Research Director, Inria Nancy-Grand Est — Examiner
- Dr. Rachel Bittner, Research Manager, Spotify Inc. — Examiner
- Dr. Romain Hennequin, Head of Research, Deezer — Examiner
- Dr. Chitralekha Gupta, Research Fellow, National University of Singapore (NUS) — Examiner
- Dr. Axel Roebel, Research Director, Ircam — PhD Supervisor
To listen, to respond, to make coincide, to coordinate, to adjust, to follow, to adapt, to be in unison, to synchronize, to align... The rich vocabulary dedicated to the correspondence of human activities shows the importance of their temporal organization. Human communication, multi-modal by nature, is fully concerned by this problematic since there exists a semantic gap between oral locutions and their symbolic sequences: how to interpret a written message without the vocal intonation? what performative style beyond a fixed musical score? This thesis proposes to uncover the complex underlying relationships between the audio and symbolic domains in order to reduce this gap through the fine study of the inherent temporality contained in voice recordings. The voice alignment task lies at the core of this objective, as it aims to determine the temporal occurrence of symbols that are assumed to be present in a voice signal. This work notably focuses on the development of an acoustic model, ADAGIO, capable of estimating such time-symbol links. Recent progress in deep learning have led to implement ADAGIO as a deep neural network in a powerful generic formalism: the “Connectionist Temporal Classification” (CTC). However, the great flexibility offered by CTC is undermined by its intrinsic lack of guarantees for temporally accurate predictions. Therefore, the key contributions of this research consist in reinforcing CTC with additional temporal constraints to improve the quality of the inferred alignments. To do so, three ancillary tasks of (1) spectral content reconstruction; (2) audio structure propagation; and (3) guided monotony are introduced and induce a positive impact on the alignment between voices, texts, and notes. Then, ADAGIO contributes to many practical applications via collaborations such as concatenative speech synthesis or the study of expressive production strategies at play for both social attitudes in speech and singing style in musical performances.