Guillaume DORAS soutient en anglais sa thèse de doctorat le jeudi 28 mai 2020 à 14H30 intitulée :
Automatic Cover Detection Using Deep Learning
Le public est invité à suivre la soutenance via : video.ircam.fr
devant le jury composé de :
Carlos Agon (Ircam/Sorbonne Université)
Rachel Bittner (Spotify)
Philippe Esling (Ircam/Sorbonne Université, membre invité)
Slim Essid (Télécom Paris, rapporteur)
Brian McFee (New York University)
Meinard Müller (International Audio Laboratories (AudioLabs) Erlangen, rapporteur)
Geoffroy Peeters(Télécom Paris, directeur de thèse)
Joan Serrà (Dolby)
Covers are different interpretations of the same original musical work. They usually share a similar melodic line or harmonic structure, but typically differ greatly in one or several other dimensions, such as structure, tempo, key, instrumentation, genre, etc. Automatic cover detection -- the task of finding and retrieving from an audio corpus all covers of one or several query tracks -- has long been seen as a challenging theoretical problem. It also became an acute practical problem for music composers societies facing continuous expansion of user-generated content including musical excerpts under copyright.
Successful approaches in cover detection usually first extract an input representation preserving common musical facets between different versions -- in particular its dominant melody or its harmonic structure, and then compute a similarity score between representation pairs. One of the challenges has always been to find a representation sufficiently expressive to embed the musical information characterizing the same work across different interpretations, while being discriminative enough to clearly separate tracks that are related to different musical works. With the ever-growing size of audio corpora, it has also become critical that this representation can be stored efficiently for fast lookup among thousands or millions of songs.
In this work, we propose to address the cover detection problem with a solution based on the metric learning paradigm. We show in particular that this approach allows training of simple neural networks to extract out of a song an expressive and compact representation -- its embedding -- allowing fast and effective retrieval in large audio corpora. We then propose a comparative study of different audio representations and show that systems combining melodic and harmonic features drastically outperform those relying on a single input representation. We illustrate how these features complement each other with both quantitative and qualitative analyses. We describe various fusion schemes and propose methods yielding state-of-the-art performances on publicly-available large datasets. Finally, we describe theoretically how the embedding space is structured during training, and introduce an adaptation of the standard triplet loss which improves the results further. We finally describe an operational implementation of the method, and demonstrate its efficiency both in terms of accuracy and scalability in a real industrial context.