Hierarchical temporal learning for neural audio synthesis of music
Student at Sorbonne University, Antoine CAILLON is defending his thesis "Hierarchical temporal learning for neural audio synthesis of music" conducted in the Musical Representations team of the IRCAM STMS laboratory under the supervision of Jean Bresson and Philippe Esling.
This thesis will be in English. It will take place at Ircam but you can follow it through the YouTube Ircam Channel https://youtube.com/live/KS7REAEhyJQ
Simon Colton - Reporter - Queen Mary University of London (United Kingdom)
Bob Sturm - Reporter - Royal institute of technology (Sweden)
Michèle Sebag - Examiner - Université Paris Saclay
Patrick Gallinari - Examiner - Sorbonne Université
Mark Sandler - Examiner - Queen Mary University of London (United Kingdom)
Jean Bresson - Thesis Director - Sorbonne Université
Philippe Esling - Thesis Co-Director and Supervisor - Sorbonne Université
Recent advances in deep learning have offered new ways to build models addressing a wide variety of tasks through the optimization of a set of parameters based on minimizing a cost function. Amongst these techniques, probabilistic generative models have yielded impressive advances in text, image and sound generation. However, musical audio signal generation remains a challenging problem. In this thesis, we study how a hierarchical approach to audio modeling can address the musical signal modeling task, while offering different levels of control to the user. Our main hypothesis is that extracting different representation levels of an audio signal allows to abstract the complexity of lower levels for each modeling stage. This would eventually allow the use of lightweight architectures, each modeling a single audio scale. We start by addressing raw audio modeling by proposing an audio model combining Variational Auto Encoders and Generative Adversarial Networks, yielding high-quality 48kHz neural audio synthesis, while being 20 times faster than real time on CPU. Then, we study how autoregressive models can be used to understand the temporal behavior of the representation yielded by this low-level audio model, using optional additional conditioning signals such as acoustic descriptors or tempo. Finally, we propose a method for using all the proposed models directly on audio streams, allowing their use in realtime applications that we developed during this thesis.