Titre :

App :

Model :

Id :

Fields :

Soutenance de thèse de Gabriel Meseguer Brocal
Éditer

these

Début :

Fin :

Location :

jeu 9 juillet 2020,
10h00
Ircam, Place Igor-Stravinsky

Contenu :
<div style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;"> <div> <div> <div>Gabriel MESEGUER BROCAL soutient sa thèse en anglais le 9 juillet 2020 à 10H</div> <div><strong><span style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; display: inline !important; float: none;"><span>Le public est invité à suivre  la soutenance via :   <a href="http://video.ircam.fr/" class="">video.ircam.fr</a></span></span></strong></div> <div><strong><span style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; display: inline !important; float: none;"></span></strong></div> <div><strong><span style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; display: inline !important; float: none;">MULTIMODAL ANALYSIS: Informed Content Estimation and Audio Source Separation</span></strong></div> <div><strong><span style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; display: inline !important; float: none;"></span></strong></div> <div>devant le jury composé de :</div> <div>Rapporteurs <span class="Apple-tab-span" style="white-space: pre;"> </span> Dr. Laurent GIRIN  & Grenoble-INP - Institut Polytechnique de Grenoble</div> <div> <span class="Apple-tab-span" style="white-space: pre;"> </span> Dr. Gael RICHARD & LTCI - Télécom Paris - Institut Polytechnique de Paris </div> <div>Examinateurs <span class="Apple-tab-span" style="white-space: pre;"> </span> Dr. Rachel BITTNER &  Spotify New York</div> <div><span class="Apple-tab-span" style="white-space: pre;"> </span> Dr. Elena CABRIO  & Université Côte d'Azur - Inria - CNRS - I3S </div> <div><span class="Apple-tab-span" style="white-space: pre;"> </span> Dr. Bruno  GAS & ISIR - UMR7222 - Sorbonne Université Paris </div> <div><span class="Apple-tab-span" style="white-space: pre;"> </span> Dr. Perfecto HERRERA BOYER  & MTG - Universitat Pompeu Fabra Barcelona</div> <div><span class="Apple-tab-span" style="white-space: pre;"> </span> Dr. Antoine  LIUTKUS & Centre Inria Nancy - Grand Est  </div> <div>Directeur  <span class="Apple-tab-span" style="white-space: pre;"> </span> Dr. Geoffroy PEETERS & LTCI - Télécom Paris - Institut Polytechnique de Paris </div> </div> </div> </div> <p></p> <p>dont voici le résumé en anglais :</p> <div style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;">Real-world stimuli are produced by complex phenomena and their constant interaction in various domains. Our understanding learns useful abstractions that fuse different modalities into a joint representation.</div> <div style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;">Multimodal learning describes methods that analyse phenomena from different modalities and their interaction in order to tackle complex tasks. This results in better and richer representations that improve the performance of the current machine learning methods.</div> <div style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;"></div> <div style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;">This dissertation proposes the study of multimodal learning in the context of musical signals. Throughout, we focus on the interaction between audio signals and text information.</div> <div style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;">Among the many text sources relate to music that can be used (e.g. reviews, metadata, or social network feedback), we concentrate on lyrics.</div> <div style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;">The singing voice directly connects the audio signal and the text information in a unique way, combining melody and lyrics where a linguistic dimension complements the abstraction of musical instruments.</div> <div style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;"></div> <div style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;">The first obstacle we address is the lack of data containing singing voice with aligned lyrics. This data is mandatory to develop our ideas. Therefore, we investigate how to create such a dataset automatically leveraging resources from the World Wide Web. Creating this type of database is a challenge in itself that raises many research questions. We are constantly working with the classic ``chicken or the egg'' problem: acquiring and cleaning this data requires accurate models, but it is difficult to train models without data. We develop a method where dataset creation and model learning are not seen as independent tasks but rather as complementary efforts. We progressively improve the model using the collected data. Every time we have an improved version, we can in turn correct and enhance the data. Finally, we propose a method to locate automatically any errors which still remain, allowing us to estimate the overall accuracy of the dataset, select points which are correct and eventually improve erroneous data.</div> <div style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;"></div> <div style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;">After developing the dataset, we center our efforts in exploring the interaction between lyrics and audio in two different tasks. First, we improve lyric segmentation by combining text and audio. Here we show that each domain captures complementary structures that benefit the overall performance. Second, we explore vocal source separation. We hypothesize that knowing the aligned phoneme information is beneficial for performing this task.</div> <div style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;">We investigate how to integrate conditioning mechanisms into source separation in multitask learning. Since multitask learning scenario is accompanied by a well-known dataset it helps us in validating the use of conditioning mechanisms. We then adapt these mechanisms for improving vocal source separation once we know the aligned phoneme.</div> <div style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;"></div> <div style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;">Finally, we summary of contributions highlighting the main research questions we approach and our proposed answers.</div> <div style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;">We discuss in detail potential future work, addressing each task individually. We first propose new uses cases of our dataset as well as ways of improving its reliability.</div> <div style="caret-color: #000000; color: #000000; font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none;">We also analyze our conditional approach developed and different strategies to improve it.</div> <p></p>

Gabriel MESEGUER BROCAL soutient sa thèse en anglais le 9 juillet 2020 à 10H

Le public est invité à suivre la soutenance via : video.ircam.fr

MULTIMODAL ANALYSIS: Informed Content Estimation and Audio Source Separation

devant le jury composé de :

Rapporteurs Dr. Laurent GIRIN & Grenoble-INP - Institut Polytechnique de Grenoble

Dr. Gael RICHARD & LTCI - Télécom Paris - Institut Polytechnique de Paris

Examinateurs Dr. Rachel BITTNER & Spotify New York

Dr. Elena CABRIO & Université Côte d'Azur - Inria - CNRS - I3S

Dr. Bruno GAS & ISIR - UMR7222 - Sorbonne Université Paris

Dr. Perfecto HERRERA BOYER & MTG - Universitat Pompeu Fabra Barcelona

Dr. Antoine LIUTKUS & Centre Inria Nancy - Grand Est

Directeur Dr. Geoffroy PEETERS & LTCI - Télécom Paris - Institut Polytechnique de Paris

dont voici le résumé en anglais :

Real-world stimuli are produced by complex phenomena and their constant interaction in various domains. Our understanding learns useful abstractions that fuse different modalities into a joint representation.

Multimodal learning describes methods that analyse phenomena from different modalities and their interaction in order to tackle complex tasks. This results in better and richer representations that improve the performance of the current machine learning methods.

This dissertation proposes the study of multimodal learning in the context of musical signals. Throughout, we focus on the interaction between audio signals and text information.

Among the many text sources relate to music that can be used (e.g. reviews, metadata, or social network feedback), we concentrate on lyrics.

The singing voice directly connects the audio signal and the text information in a unique way, combining melody and lyrics where a linguistic dimension complements the abstraction of musical instruments.

The first obstacle we address is the lack of data containing singing voice with aligned lyrics. This data is mandatory to develop our ideas. Therefore, we investigate how to create such a dataset automatically leveraging resources from the World Wide Web. Creating this type of database is a challenge in itself that raises many research questions. We are constantly working with the classic ``chicken or the egg'' problem: acquiring and cleaning this data requires accurate models, but it is difficult to train models without data. We develop a method where dataset creation and model learning are not seen as independent tasks but rather as complementary efforts. We progressively improve the model using the collected data. Every time we have an improved version, we can in turn correct and enhance the data. Finally, we propose a method to locate automatically any errors which still remain, allowing us to estimate the overall accuracy of the dataset, select points which are correct and eventually improve erroneous data.

After developing the dataset, we center our efforts in exploring the interaction between lyrics and audio in two different tasks. First, we improve lyric segmentation by combining text and audio. Here we show that each domain captures complementary structures that benefit the overall performance. Second, we explore vocal source separation. We hypothesize that knowing the aligned phoneme information is beneficial for performing this task.

We investigate how to integrate conditioning mechanisms into source separation in multitask learning. Since multitask learning scenario is accompanied by a well-known dataset it helps us in validating the use of conditioning mechanisms. We then adapt these mechanisms for improving vocal source separation once we know the aligned phoneme.

Finally, we summary of contributions highlighting the main research questions we approach and our proposed answers.

We discuss in detail potential future work, addressing each task individually. We first propose new uses cases of our dataset as well as ways of improving its reliability.

We also analyze our conditional approach developed and different strategies to improve it.

Éditer

Titre : App : Model : Id : Fields : Soutenance de thèse de Gabriel Meseguer Brocal Éditer

Titre :

App :

Model :

Id :

Fields :

Soutenance de thèse de Gabriel Meseguer Brocal
Éditer