Browsing by Subject "Audio-visual learning"
Now showing 1 - 2 of 2
- Results Per Page
- Sort Options
Item 4D audio-visual learning: a visual perspective of sound propagation and production(2024-05) Chen, Changan 1995-; Grauman, Kristen Lorraine, 1979-; David Harwath; Dinesh Manocha; Yuke Zhu; Andrea VedaldiHumans use multiple modalities to perceive the world, including vision, sound, touch, and smell. Among them, vision and sound are two of the most important modalities that naturally co-occur. For example, we see and hear dogs barking, people having conversations, or cars honking on roads in our daily lives. Recent work has been exploring this natural correspondence between sight and sound, which are, however, mainly object-centric, i.e., the semantic relations between objects and the sounds they make. While exciting, the correspondence with the surrounding 3D space is often overlooked. For example, we hear the same sound differently in different environments or even different locations in the same environment. In this thesis, I present 4D audio-visual learning, which learns the correspondence between sight and sounds in spaces, providing a visual perspective of sound propagation and sound production. More specifically, I focus on four topics in this direction: simulating sounds in spaces, navigating with sounds in spaces, synthesizing sounds in spaces and learning action sounds. Throughout these topics, I use vision as the main bridge to connect audio and scene understanding. Below, I will detail the work on each of these topics. Simulating sounds in spaces: Collecting visual-acoustic measurements is costly in the real world. To enable machine learning models, I begin with building a first-of-its-kind simulation platform named SoundSpaces. Given any arbitrary source sound, source/receiver locations, and the mesh of the 3D environment, SoundSpaces produces realistic audio renderings simulating how sounds propagate in space as a function of the 3D environments and materials of different surfaces. Coupled with a modern visual rendering pipeline called Habitat, SoundSpaces produces 3D consistent visual and audio renderings. It is also continuous, configurable, and generalizable to novel environments. This platform has unlocked many research opportunities, enabling multimodal embodied AI and beyond. Navigating with sounds in spaces: In robotics, navigating to localize a sound is an important application, for example, rescue robots searching for people or home service robots locating speech commands. However, existing robots mainly perceive the environment with vision sensors alone. To empower robots to see and hear, I introduce the audio-visual navigation task, where an embodied agent must navigate to the sounding object in an unknown environment by seeing and hearing. I train an end-to-end navigation policy based on reinforcement learning that predicts an action at every time step. This policy not only navigates to find the sounding object but also generalizes to unheard sounds and unseen environments. In a follow-up work, I introduce a hierarchical navigation policy that learns to set waypoints in an end-to-end fashion which further improves the navigation efficiency of the previous work. I also investigate the semantic audio-visual navigation problem, where sounds always come from semantically meaningful and visible objects, and I show that my proposed policy can learn to associate how objects sound to how they look without explicit annotations. Lastly, I show that we can also transfer the policy trained in simulation to the real world with frequency-adaptive prediction and demonstrate that with a physical robot platform. Synthesizing sounds in spaces: While it is important to study sight and sound in an embodied setting, isolating perception from decision-making is also valuable for applications in augmented reality or virtual reality, such as generating matching audio-visual streams for immersive experiences. I first propose the audio-visual dereverberation task, the goal of which is to remove reverberation from audio by utilizing visual cues. I show that the proposed model does well on downstream tasks such as speech recognition and speaker identification. In other applications, it is also desirable to add reverberation to audio to match the environment acoustics. I then investigate the inverse task: visual acoustic matching, where we transform audio to match the acoustics of a scene. Coupled with a self-supervised acoustic alteration strategy, the model learns to inject the proper amount of reverberation into the audio corresponding to the acoustics of the space. Lastly, to model the fine-grained acoustic changes within a scene, I propose the novel-view acoustic synthesis task, which requires the model to further reason about the nuanced change of audio in the same space at novel viewpoints. Learning action sounds: Vision not only provides cues about how sound propagates in the space as a function of the environment configurations but also captures how sounds are produced. Learning or generating sounds from silent videos is important for applications such as creating sound effects for films or virtual reality games. To understand how our physical activities produce sound, I propose to learn how human actions sound from narrated in-the-wild egocentric videos with a novel multimodal consensus embedding approach. I show that our model successfully discovers sounding actions from in-the-wild videos and learns embeddings for cross-modality retrieval. I then investigate how to generate temporally and semantically matching action sounds from silent videos. I propose a novel ambient-aware audio generation model that learns to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos, which also enables controllable generation of the ambient sound. Overall, my thesis covers promising directions in 4D audio-visual learning, that is, building fundamental simulation platforms, enabling multimodal embodied perception, providing faithful multimodal synthesis in 3D environments, and learning action sounds from in-the-wild videos. I show results on real videos and real-world environments, as well as simulation. In the last chapter of my thesis, I outline the potential research that remains to be explored in the future for 4D audio-visual learning.Item From active to passive spatial acoustic sensing and applications(2022-12-21) Sun, Wei (Ph. D. in computer science); Qiu, Lili, Ph. D.; Mok, Aloysius K.; Harwath, David; Yun, SangkiThe active acoustic sensing system emits modulated acoustic waves and analyzes reflection signals. It is dominant in acoustic spatial sensing. On the other side, the passive acoustic sensing system receives and investigates nature sounds directly. It is good at semantic tasks but has weak performance on spatial sensing. In this dissertation, we manage to bridge three gaps in existing systems. They are the gap between the assumption of signal processing algorithms and the real acoustic environment, the gap between powerful active spatial sensing and limited passive spatial sensing, and the gap between the semantic features and spatial information. We evolve the acoustic sensing system design and extend the functionalities by three novel systems. First, we develop a fully active spatial sensing system DeepRange which can adapt to the real environment easily. We develop an effective mechanism to generate synthetic training data that captures noise, speaker/mic distortion, and interference in the signals. It removes the need of collecting a large volume of data. We then design a deep range neural network (DRNet) to estimate the distance from raw acoustic signals. It is inspired by signal processing that an ultra-long convolution kernel size helps to combat noise and interference. The model is fully trained over synthetic data, but it can achieve sub-centimeter error robustly in real data despite various environments, background noise, interference, and mobile phone models. Second, we develop a fused active and passive spatial sensing system for speech separation noted as Spatial Aware Multi-task learning-based Separation (SAMS). We leverage both active sensing and passive sensing to improve AoA estimation and jointly optimize the semantic task and the spatial task. SAMS estimates the spatial location and extracts speech for the target user during teleconferencing simultaneously. We first generate fine-grained spatial embeddings from the user’s voice and inaudible tracking sound, which contains the user’s position and rich multipath information. Furthermore, we develop a deep neural network with multi-task learning to jointly optimize source separation and location. We significantly speed up inference to provide a real-time guarantee. Finally, we deeply fuse the semantic features and spatial cues to combat the interference and noise in the real environment as well as enable depth sensing in a fully passive setup. Inspired by the ”flash-to-bang” phenomenon (i.e.hearing the thunder after seeing the lightning), we propose FBDepth to measure the depth of the sound source. We formulate the problem as an audio-visual event localization task for collision events. Specifically, FBDepth first aligns correspondence between the video track and audio track to locate the target object and target sound in a coarse granularity. Based on the observation of moving objects’ trajectories, it proposes to estimate the intersection of optical flow before and after the collision to locate video events in time. It feeds the estimated timestamp of the video event and the other modalities for the final depth estimation. We use a mobile phone to collect the 3.6K+ video clips involving 24 different objects at up to 60m. FBDepth shows superior performance especially at a long range compared to monocular and stereo methods.