As part of the project HAIKUS (ANR-19-CE23-0023), funded by the french national research agency, IRCAM, LORIA and IJLRA are organizing a one-day workshop focusing on methodological advances for Audio Augmented Reality and its applications.
Audio Augmented Reality (AAR) seeks to integrate computer-generated and/or pre-recorded auditory content into the listener's real-world environment. Hearing plays a vital role in understanding and interacting with our spatial environment. It significantly enhances the auditory experience and increases user engagement in Augmented Reality (AR) applications, particularly in artistic creation, cultural mediation, entertainment and communication industries.
Audio-signal processors are a key component of the AAR workflow, as they are required for real-time control of 3D sound spatialisation and artificial reverberation applied to virtual sound events. These tools have now reached a level of maturity, capable of supporting large multichannel loudspeaker systems as well as binaural rendering on headphones. However, the accuracy of the spatial processing applied to virtual sound objects is essential to ensure their seamless integration into the listener's real environment, thereby guaranteeing a high-quality user experience. To achieve this level of integration, methods are needed to identify the acoustic properties of the environment and adjust the spatialization engine's parameters accordingly. Ideally, such methods should enable automatic inference of the acoustic channel's characteristics, based solely on live recordings of the natural, and often dynamic, sounds present in the real environment (e.g. voices, noise, ambient sounds, moving sources). These topics are gaining increasing attention, especially in light of recent advances on data-driven approaches within the field of acoustics. In parallel, perceptual studies are conducted to define the level of requirements needed to guarantee a coherent sound experience.
Organising committee :
Antoine Deleforge (INRIA), François Ollivier (MPIA-IJLRA), Olivier Warusfel (IRCAM)
Provisional programme (schedule and order of speakers subject to change)
Panel list and talks :
Toon van Waterschoot (KU Leuven - B)
Cagdas Tuna (Fraunhofer IIS - D)
Summary: Knowledge of geometric properties of a room may be very beneficial for many audio applications, including sound source localization, sound reproduction, and augmented and virtual reality. Room geometry inference (RGI) deals with the problem of acoustic reflector localization based on room impulse responses recorded between loudspeakers and microphones.
Rooms with highly absorptive walls or walls at large distances from the measurement setup pose challenges for RGI methos. In the first part of the talk, we present a data-driven method to jointly detect and localize acoustic reflectors that correspond to nearby and/or reflective walls. We employ a multi-branch convolutional recurrent neural network whose input consists of a time-domain acoustic beamforming map, obtained via Radon transform from multi-channel room impulse responses. We propose a modified loss function forcing the network to pay more attention to walls that can be estimated with a small error. Simulation results show that the proposed method can detect nearby and/or reflective walls and improve the localization performance for the detected walls.
Data-driven RGI methods generally rely on simulated data since the RIR measurements in a diverse set of rooms may be a prohibitively time-consuming and labor-intensive task. In the second part of the talk, we explore regularization methods to improve RGI accuracy when deep neural networks are trained with simulated data and tested with measured data. We use a smart speaker prototype equipped with multiple microphones and directional loudspeakers for real-world RIR measurements. The results indicate that applying dropout at the network’s input layer results in improved generalization compared to using it solely in the hidden layers. Moreover, RGI using multiple directional loudspeakers leads to increased estimation accuracy when compared to the single loudspeaker case, mitigating the impact of source directivity.
Antoine Deleforge (INRIA - FR)
Summary: Estimating acoustic parameters, such as the localization of a sound source, the geometry, or the acoustical properties of an environment from audio recordings, is a crucial component of audio augmented reality systems. These tasks become especially challenging in the blind setting, e.g., when using noisy recordings of human speakers. Significant progress has been made in recent years thanks to the advent of supervised machine learning. However, these methods are often hindered by the limited availability of real-world annotated data for such tasks. A common strategy has been to use acoustic simulators to train such models, a framework we refer to as "Virtually Supervised Learning." In this talk, we will explore how the realism of simulation impacts the generalizability of virtually-supervised models to real-world data. We will focus on the tasks of sound source localization, room geometry estimation, and reverberation time estimation from noisy multichannel speech recordings. Our results suggests that enhancing the realism of the source, microphone, and wall responses during simulated training by making them frequency- and angle-dependent significantly improves generalization performance.
François Ollivier (IJLRA - Sorbonne Univ. FR)
Summary: This presentation covers the design, characteristics and implementation of a spherical microphone array using 256 Mems cells (HOSMA). This HOSMA is designed for directional analysis of room acoustics at order 15. The array uses advanced techniques to capture spatial audio with high accuracy, enabling 3D acoustic analysis and sound field decomposition in the spherical harmonics (SH) domain. Design considerations include optimal microphone placement on the spherical surface, ensuring uniform spatial sampling and minimizing aliasing effects. The characteristics of the HOSMA are evaluated using simulations and real experiments. Implementation challenges, such as calibration and signal processing, are discussed. Applications in room acoustics, such as the estimation of directional room impulse responses (DRIRs) and sound source localization, are presented. They enable us to estimate the HOSMA's potential in both research and practical scenarios. The first developments in a research project using the HOSMA for machine-learning-based DRIR interpolation are also presented.
Annika Neidhardt (Surrey University - UK )
The expectations and perceptual require a lot on the specific content and application.
How can we make use of that? Is there a simple technical solution?
This presentation will discuss different technical approaches that seem very promising.
Sebastian Schlecht (Friedrich-Alexander-Universität - D)
Summary: In spatial audio, accurately modelling sound field decay is critical for realistic 6DoF audio experiences. This talk introduces the common-slope model, a compact approach that utilizes an energetic sound field description to represent spatial energy decay smoothly and efficiently. We will explore the derivation of this model, demonstrating estimation techniques based on measured or simulated impulse responses (IRs). Particular focus will be given to applications in complex environments, such as coupled room systems, and unique phenomena like fade-in behaviour at the onset of reverberation. Additionally, we’ll discuss how common-slope parameters can be directly derived from room acoustic geometry using acoustic radiance transfer, offering insights into practical implementations in virtual and augmented reality audio.
Olivier Warusfel (IRCAM- FR)
This workshop is supported by the ANR and the French Ministry of Culture.