Gabriel Meseguer Brocal's thesis will be on the 9th of July at 10 AM
It is open to the public via: video.ircam.fr
the title is :MULTIMODAL ANALYSIS: Informed Content Estimation and Audio Source Separation
with his jury :
Dr. Laurent GIRIN & Grenoble-INP - Institut Polytechnique de Grenoble
Dr. Gael RICHARD & LTCI - Télécom Paris - Institut Polytechnique de Paris
Examinateurs Dr. Rachel BITTNER & Spotify New York
Dr. Elena CABRIO & Université Côte d'Azur - Inria - CNRS - I3S
Dr. Bruno GAS & ISIR - UMR7222 - Sorbonne Université Paris
Dr. Perfecto HERRERA BOYER & MTG - Universitat Pompeu Fabra Barcelona
Dr. Antoine LIUTKUS & Centre Inria Nancy - Grand Est
Directeur Dr. Geoffroy PEETERS & LTCI - Télécom Paris - Institut Polytechnique de Paris
Real-world stimuli are produced by complex phenomena and their constant interaction in various domains. Our understanding learns useful abstractions that fuse different modalities into a joint representation.
Multimodal learning describes methods that analyse phenomena from different modalities and their interaction in order to tackle complex tasks. This results in better and richer representations that improve the performance of the current machine learning methods.
This dissertation proposes the study of multimodal learning in the context of musical signals. Throughout, we focus on the interaction between audio signals and text information.
Among the many text sources relate to music that can be used (e.g. reviews, metadata, or social network feedback), we concentrate on lyrics.
The singing voice directly connects the audio signal and the text information in a unique way, combining melody and lyrics where a linguistic dimension complements the abstraction of musical instruments.
The first obstacle we address is the lack of data containing singing voice with aligned lyrics. This data is mandatory to develop our ideas. Therefore, we investigate how to create such a dataset automatically leveraging resources from the World Wide Web. Creating this type of database is a challenge in itself that raises many research questions. We are constantly working with the classic ``chicken or the egg'' problem: acquiring and cleaning this data requires accurate models, but it is difficult to train models without data. We develop a method where dataset creation and model learning are not seen as independent tasks but rather as complementary efforts. We progressively improve the model using the collected data. Every time we have an improved version, we can in turn correct and enhance the data. Finally, we propose a method to locate automatically any errors which still remain, allowing us to estimate the overall accuracy of the dataset, select points which are correct and eventually improve erroneous data.
After developing the dataset, we center our efforts in exploring the interaction between lyrics and audio in two different tasks. First, we improve lyric segmentation by combining text and audio. Here we show that each domain captures complementary structures that benefit the overall performance. Second, we explore vocal source separation. We hypothesize that knowing the aligned phoneme information is beneficial for performing this task.
We investigate how to integrate conditioning mechanisms into source separation in multitask learning. Since multitask learning scenario is accompanied by a well-known dataset it helps us in validating the use of conditioning mechanisms. We then adapt these mechanisms for improving vocal source separation once we know the aligned phoneme.
Finally, we summary of contributions highlighting the main research questions we approach and our proposed answers.
We discuss in detail potential future work, addressing each task individually. We first propose new uses cases of our dataset as well as ways of improving its reliability.
We also analyze our conditional approach developed and different strategies to improve it.