Giovanni Bindi, a doctoral candidate at Sorbonne University within the École Doctorale Informatique, Telecom et Électronique (EDITE) in Paris, conducted his research entitled “Compositional Learning of Audio Representations” at the STMS laboratory (Ircam – Sorbonne Université – CNRS – Ministry of Culture), within the Analysis / Synthesis team, under the supervision of Philippe Esling.
The defense will take place in English, in the Stravinsky room at IRCAM on Monday, December 1st, 2025 at 14h00. It will be recorded and posted on YouTube: https://youtube.com/live/bGBVgHi5s_0
The jury will be composed of:
• George Fazekas, Queen Mary University of London (Reviewer)
• Magdalena Fuentes, New York University (Reviewer)
• Ashley Burgoyne, University of Amsterdam (Examiner)
• Mark Sandler, Queen Mary University of London (Examiner)
• Geoffroy Peeters, Télécom Paris (Examiner)
• Philippe Esling, Sorbonne University (Supervisor)
Abstract:
This thesis explores the intersection of machine learning, deep generative models, and musical composition. While machine learning has transformed numerous fields, its application to music - and creative arts more broadly - raises specific challenges. We investigate the learning of compositional representations for musical audio, building on unsupervised decomposition of audio mixtures and probabilistic generative modeling. Guided by the principle of compositionality, according to which complex data can be described as combinations of simpler, reusable elements, we seek to understand how this principle manifests in musical audio signals.
Our framework is built on two complementary phases: decomposition and recomposition. In the decomposition phase, we introduce a simple, flexible, domain-agnostic model that learns to separate an input signal into several latent components without explicit supervision, which we apply notably to multi-instrument audio recordings. In the recomposition phase, we leverage these components within lightweight conditional generative models to produce new arrangements or complete missing parts of a musical accompaniment given some context. This thesis thus aims at contribuiting the bridging of unsupervised decomposition and generative modeling for musical audio signals.
![]()