A neural voice transformation framework for modification of pitch and intensity
Frederik Bous has written his thesis on "A neural voice transformation framework for modification of pitch and intensity" as part of the Analysis Synthesis Team at the STMS Laboratory (Ircam - CNRS - Sorbonne University - Ministry of Culture). His research has been funded by an EDITE grant and by the ANR "ARS" project. His work has led him to collaborate at the same time with the artist Judith Deschamps (during her Artistic Research Residency at Ircam) to recreate Farinelli's voice.
He invites you to his thesis defence at Ircam on Thursday 21 September at 14:30. The presentation will be in English and will be live on YouTube at the following link: https://youtube.com/live/rADj7VUEKt0
His jury will be composed of :
- Prof. Thierry Dutoit - University Professor - University of Mons (Belgium) - Rapporteur
- Prof. Yannis Stylianou - University Professor - University of Crete (Greece) - Rapporteur
- Dr. Christophe d'Alessandro - Research Director (HDR) - Institut Jean-Le-Rond-d'Alembert - Examiner
- Dr. Jordi Bonada - Research Fellow - Pompeu Fabra University (UPF) (Spain) - Examiner
- Dr. Nathalie Henrich - Research Director (HDR) - Grenoble Alpes University, UMR 5216 - Examiner
- Dr. Axel Roebel - Research Director (HDR) - Ircam, STMS Lab - Thesis Director
Abstract:
The human voice has been a source of fascination and an object of research for over 100 years, and numerous technologies for voice processing have been developed. In this thesis we are concerned with vocoders, which are methods that provide parametric representations of voice signals, and that can be used for voice transformation. Previous studies have demonstrated important limitations of approaches based on explicit signal models: for realistic sounding transformations the dependencies between different voice properties have to be modelled precisely, but unfortunately none of the models proposed so far has been sufficiently refined to correctly express these dependencies.
Recently deep neural networks have demonstrated impressive success in extracting parameter dependencies from data and this thesis sets out to create a voice transformation framework using these networks. The framework works in two stages: first a neural vocoder establishes an invertible mapping between raw voice signals and a mel-spectrogram representation. Secondly, an auto-encoder that establishes an invertible mapping between the mel spectrogram and the voice representation used for voice transformation. The auto-encoder has the task to create what is called the residual code, following two objectives. First, together with the control parameter the residual code should allow to recreate the original mel spectrogram. Second, the residual code should be independent of (disentangled from) the control parameter. If successful, these objectives will allow creating coherent voice signals from the potentially manipulated target parameter and the residual code.
In the first part of the thesis, we discuss different approaches to neural vocoding and the advantages of using the mel-spectrogram compared to conventional parametric vocoder spaces. In the second part we present the proposed auto-encoder using an information-bottleneck to achieve the disentanglement. We demonstrate experimental results concerning two control parameters: the fundamental frequency and the voice level. Transformation of the fundamental frequency is a task that has been frequently studied in the past that allows comparing our approach to existing techniques and studying how the auto-encoder models the dependency on other properties. For the voice level, we face the problem that annotations hardly exist. Therefore, first we provide a new estimation technique for voice level in large voice databases, and subsequently use the voice level annotations to train a bottleneck auto-encoder that allows changing the voice level.