4D audio-visual learning: a visual perspective of sound propagation and production
dc.contributor.advisor | Grauman, Kristen Lorraine, 1979- | |
dc.contributor.committeeMember | David Harwath | |
dc.contributor.committeeMember | Dinesh Manocha | |
dc.contributor.committeeMember | Yuke Zhu | |
dc.contributor.committeeMember | Andrea Vedaldi | |
dc.creator | Chen, Changan 1995- | |
dc.date.accessioned | 2024-07-26T15:48:31Z | |
dc.date.available | 2024-07-26T15:48:31Z | |
dc.date.created | 2024-05 | |
dc.date.issued | 2024-05 | |
dc.date.submitted | May 2024 | |
dc.date.updated | 2024-07-26T15:48:31Z | |
dc.description.abstract | Humans use multiple modalities to perceive the world, including vision, sound, touch, and smell. Among them, vision and sound are two of the most important modalities that naturally co-occur. For example, we see and hear dogs barking, people having conversations, or cars honking on roads in our daily lives. Recent work has been exploring this natural correspondence between sight and sound, which are, however, mainly object-centric, i.e., the semantic relations between objects and the sounds they make. While exciting, the correspondence with the surrounding 3D space is often overlooked. For example, we hear the same sound differently in different environments or even different locations in the same environment. In this thesis, I present 4D audio-visual learning, which learns the correspondence between sight and sounds in spaces, providing a visual perspective of sound propagation and sound production. More specifically, I focus on four topics in this direction: simulating sounds in spaces, navigating with sounds in spaces, synthesizing sounds in spaces and learning action sounds. Throughout these topics, I use vision as the main bridge to connect audio and scene understanding. Below, I will detail the work on each of these topics. Simulating sounds in spaces: Collecting visual-acoustic measurements is costly in the real world. To enable machine learning models, I begin with building a first-of-its-kind simulation platform named SoundSpaces. Given any arbitrary source sound, source/receiver locations, and the mesh of the 3D environment, SoundSpaces produces realistic audio renderings simulating how sounds propagate in space as a function of the 3D environments and materials of different surfaces. Coupled with a modern visual rendering pipeline called Habitat, SoundSpaces produces 3D consistent visual and audio renderings. It is also continuous, configurable, and generalizable to novel environments. This platform has unlocked many research opportunities, enabling multimodal embodied AI and beyond. Navigating with sounds in spaces: In robotics, navigating to localize a sound is an important application, for example, rescue robots searching for people or home service robots locating speech commands. However, existing robots mainly perceive the environment with vision sensors alone. To empower robots to see and hear, I introduce the audio-visual navigation task, where an embodied agent must navigate to the sounding object in an unknown environment by seeing and hearing. I train an end-to-end navigation policy based on reinforcement learning that predicts an action at every time step. This policy not only navigates to find the sounding object but also generalizes to unheard sounds and unseen environments. In a follow-up work, I introduce a hierarchical navigation policy that learns to set waypoints in an end-to-end fashion which further improves the navigation efficiency of the previous work. I also investigate the semantic audio-visual navigation problem, where sounds always come from semantically meaningful and visible objects, and I show that my proposed policy can learn to associate how objects sound to how they look without explicit annotations. Lastly, I show that we can also transfer the policy trained in simulation to the real world with frequency-adaptive prediction and demonstrate that with a physical robot platform. Synthesizing sounds in spaces: While it is important to study sight and sound in an embodied setting, isolating perception from decision-making is also valuable for applications in augmented reality or virtual reality, such as generating matching audio-visual streams for immersive experiences. I first propose the audio-visual dereverberation task, the goal of which is to remove reverberation from audio by utilizing visual cues. I show that the proposed model does well on downstream tasks such as speech recognition and speaker identification. In other applications, it is also desirable to add reverberation to audio to match the environment acoustics. I then investigate the inverse task: visual acoustic matching, where we transform audio to match the acoustics of a scene. Coupled with a self-supervised acoustic alteration strategy, the model learns to inject the proper amount of reverberation into the audio corresponding to the acoustics of the space. Lastly, to model the fine-grained acoustic changes within a scene, I propose the novel-view acoustic synthesis task, which requires the model to further reason about the nuanced change of audio in the same space at novel viewpoints. Learning action sounds: Vision not only provides cues about how sound propagates in the space as a function of the environment configurations but also captures how sounds are produced. Learning or generating sounds from silent videos is important for applications such as creating sound effects for films or virtual reality games. To understand how our physical activities produce sound, I propose to learn how human actions sound from narrated in-the-wild egocentric videos with a novel multimodal consensus embedding approach. I show that our model successfully discovers sounding actions from in-the-wild videos and learns embeddings for cross-modality retrieval. I then investigate how to generate temporally and semantically matching action sounds from silent videos. I propose a novel ambient-aware audio generation model that learns to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos, which also enables controllable generation of the ambient sound. Overall, my thesis covers promising directions in 4D audio-visual learning, that is, building fundamental simulation platforms, enabling multimodal embodied perception, providing faithful multimodal synthesis in 3D environments, and learning action sounds from in-the-wild videos. I show results on real videos and real-world environments, as well as simulation. In the last chapter of my thesis, I outline the potential research that remains to be explored in the future for 4D audio-visual learning. | |
dc.description.department | Computer Science | |
dc.format.mimetype | application/pdf | |
dc.identifier.uri | ||
dc.identifier.uri | https://hdl.handle.net/2152/126173 | |
dc.identifier.uri | https://doi.org/10.26153/tsw/52710 | |
dc.language.iso | English | |
dc.subject | Audio-visual learning | |
dc.subject | Embodied AI | |
dc.subject | Egocentric videos | |
dc.subject | Acoustic simulation | |
dc.subject | Crossmodal generation | |
dc.title | 4D audio-visual learning: a visual perspective of sound propagation and production | |
dc.type | Thesis | |
dc.type.material | text | |
thesis.degree.college | College of Natural Sciences | |
thesis.degree.department | Computer Sciences | |
thesis.degree.discipline | Computer Science | |
thesis.degree.grantor | The University of Texas at Austin | |
thesis.degree.name | Doctor of Philosophy | |
thesis.degree.program | Doctoral Program |
Access full-text files
Original bundle
1 - 1 of 1