Adrien BITTON had his thesis defence "Meaningful Audio Synthesis and Musical Interaction by Representation Learning of Sound Sample Databases" on the 14th of June 2021, at 11:00 AM,
with the jury:
(reviewer) Pr. Philippe PASQUIER, Simon Fraser University, Canada.
(reviewer) Pr. Charalampos SAITIS, Queen Mary University of London, United Kingdom.
Pr. Jean-Pierre BRIOT, Sorbonne Université, France.
Pr. Myriam DESAINTE-CATHERINE, Université de Bordeaux, France.
Pr. Dorien HERREMANS, Singapore University of Technology and Design, Singapore.
Pr. Bob L. T. STURM, Royal Institute of Technology KTH, Sweden.
Prf. Carlos Agon (Sorbonne Université) and Philippe Esling.
Computer assisted music extensively relies on audio sample libraries and virtual instruments which provide users an ever increasing amount of contents to produce music with. However, principled methods for large-scale interactions are lacking so that browsing samples and presets with respect to a target sound idea is a tedious and arbitrary process. Indeed, library metadata can only describe coarse categories of sounds but do not meaningfully traduce the underlying acoustic contents and continuous variations in timbre which are key elements of music production and creativity. Timbre perception has been studied by carrying listening tests and organising human ratings into low dimensional spaces which reflect the perceived similarity of sounds, however these analysis spaces do not generalise to new and unrated examples, nor they allow to synthesise audio. [...]
The recent advances in deep generative modelling show unprecedented successes at learning large-scale unsupervised representations which invert to data as diverse as images, texts and audio. These probabilistic models could be refined to specific generative tasks such as unpaired image translation and semantic manipulations of visual features, demonstrating the ability of learning transformations and representations that are perceptually meaningful. [...]
In this thesis, we target efficient analysis and synthesis with auto-encoders to learn low dimensional acoustic representations for timbre manipulations and intuitive interactions for music production. We adapt domain translation techniques to timbre transfer and propose alternatives to adversarial learning for many-to-many transfers. In this process, timbre is implicitly modelled by disentangling the representations of domain specific and domain invariant features. Then we develop models for explicit modelling of timbre variations and controllable audio sampling using conditioning for semantic attribute manipulations and hierarchical learning to represent both acoustic and temporal variations. We also apply discrete representation learning to decompose a target timbre into short-term acoustic features that are applied to audio conversions such as timbre transfer and voice-driven synthesis. By analysing and mapping this discrete latent representation, we show that we can directly control synthesis by acoustic descriptors. Finally, we investigate the possibility of further reducing the complexity of trained models by weight trimming for real-time inference with constrained computational resources. Because the objectives used for training the models are often disjoint from the ultimate generative application, our discussion and evaluation emphasise both aspects of learning performance and usability as a creative tool for music production. [...]