Artificial Intelligence Applied to Augmented Acoustic Scenes
Applications using augmented acoustic reality are receiving attention in a broad range of fields including artistic creation, cultural mediation, communication, and entertainment. Audition is a key modality to understand and to interact with our spatial environment, and plays a major role in augmented reality applications. Embedding computer-generated or pre-recorded auditory content into a user's real acoustic environment creates an engaging and interactive experience that can be applied to video games, museum guides or radio plays. The major challenge of audio processing in augmented reality applications lies in the ability to integrate these sound events without a perceptual gap, i.e. with a spatial rendering that constantly adapts to the acoustic conditions of the real environment, for example adapting to the movement of the sound sources or of the listener.
The objective of the HAIKUS project is the joint exploitation of machine learning and audio signal processing methods to solve acoustic problems encountered in augmented reality applications. Machine learning methods have been applied for the automatic identification of the acoustic channels between the sources and the listener. Integrating virtual sounds in a real environment require the estimation of room or site’s acoustic parameters enabling automatic adaptation of the processing of observed reverberant audio signals applied to virtual sources. The challenge is therefore the blind estimation of the acoustic parameters (reverberation time, live sound/ reverberant audio signals relationship) or the geometry of the room (volume and shape of the rooms, wall absorption) based on simple observation of the reverberant audio signals from real sound sources present in the room. The listener’s acceptance of the augmented acoustic scene is based on a realistic and congruent evolution of the acoustic cues with their movement in the scene and the movement of the virtual sources. This requires the inference of plausible rules for modifying spatialization parameters, or the implementation of room impulse response interpolation techniques, according to the relative movements of the sources and the listener.
Interactive virtual sound scenes are generally rendered in binaural audio for headphones. Convincing binaural rendering requires the use of individual HRTFs that must be personalized for each listener in an anechoic chamber with perfectly calibrated audio signals. We propose the blind estimation of the listener’s HRTFs based on binaural signals captured in real environments using non-supervised methods (audio sources and in motion).