Look and listen : from semantic to spatial audio-visual perception

Gao, Ruohan

Look and listen : from semantic to spatial audio-visual perception

dc.contributor.advisor	Grauman, Kristen Lorraine, 1979-
dc.contributor.committeeMember	Zisserman, Andrew
dc.contributor.committeeMember	Mooney, Raymond
dc.contributor.committeeMember	Huang, Qixing
dc.creator	Gao, Ruohan
dc.date.accessioned	2021-07-27T17:14:59Z
dc.date.available	2021-07-27T17:14:59Z
dc.date.created	2021-05
dc.date.issued	2021-07-24
dc.date.submitted	May 2021
dc.date.updated	2021-07-27T17:14:59Z
dc.description.abstract	Understanding scenes and events is inherently a multi-modal experience. We perceive the world by both looking and listening (and touching, smelling, and tasting). In particular, the sounds made by objects, whether actively generated or incidentally emitted, offer valuable signals about their physical properties and spatial locations—the cymbals crash on stage, the bird tweets up in the tree, the truck revs down the block, the silverware clinks in the drawer. However, while recognition has made significant progress by "looking"—detecting objects, actions, or people based on their appearance—it often does not listen. In this thesis, I show that audio that accompanies visual scenes and events can be used as a rich source of training signal for learning (audio-)visual models. Particularly, I have developed computational models that leverage both the semantic and spatial signals in audio to understand people, places, and things from continuous multi-modal observations. Below, I summarize my key contributions along these two themes: Audio as a semantic signal: First, I develop methods that learn how different objects sound by both looking at and listening to unlabeled video containing multiple sounding objects. I propose an unsupervised approach to separate mixed audio into its component sound sources by disentangling the audio frequency bases for detected visual objects. Next, I further propose a new approach that trains audio-visual source separation models on pairs of training videos. This co-separation framework permits both end-to-end training and learning object-level sounds from unlabeled videos of multiple sound sources. As an extension of the co-separation approach, then I study the classic cocktail party problem to separate voices from the speech mixture by leveraging the consistency between the speaker's facial appearance and their voice. The two modalities, vision and audition, are mutually beneficial. While visual objects are indicative of the sounds they make to enhance audio source separation, audio can also be informative of the visual events in videos. Finally, I propose a framework that uses audio as a semantic signal to help visual events classification. I design a preview mechanism to eliminate both short-term and long-term visual redundancies using audio for efficient action recognition in untrimmed video. Audio as a spatial signal: Both audio and visual data also convey significant spatial information. The two senses naturally work in concert to interpret spatial signals. Particularly, the human auditory system uses two ears to extract individual sound sources from a complex mixture. Leveraging the spatial signal in videos, I devise an approach to lift a flat monaural audio signal to binaural audio by injecting the spatial cues embedded in the accompanying visual frames. When listening to the predicted binaural audio—the 2.5D visual sound—listeners can then feel the locations of the sound sources as they are displayed in the video. Beyond learning from passively captured video, I next explore the spatial signal in audio by deploying an agent to actively interact with the environment using audio. I propose a novel representation learning framework that learns useful visual features via echolocation by capturing echo responses in photo-realistic 3D indoor scene environments. Experimental results demonstrate that the image features learned from echoes are comparable or even outperform heavily supervised pre-training methods for multiple fundamental spatial tasks—monocular depth prediction, surface normal estimation, and visual navigation. Our results serve as an exciting prompt for future work leveraging both the visual and audio modalities. Motivated by how we humans perceive and act in the world by making use of all our senses, the long-term goal of my research is to build systems that can perceive as well as we do by combining all the multisensory inputs. In the last chapter of my thesis, I outline the potential future research directions that I want to pursue beyond my Ph.D. dissertation.
dc.description.department	Computer Sciences
dc.format.mimetype	application/pdf
dc.identifier.uri	https://hdl.handle.net/2152/86943
dc.identifier.uri	http://dx.doi.org/10.26153/tsw/13893
dc.language.iso	en
dc.subject	Audio-visual
dc.subject	Multi-modal
dc.subject	Video
dc.subject	Semantic
dc.subject	Spatial
dc.subject	Source separation
dc.subject	Action recognition
dc.subject	Audio spatialization
dc.subject	Representation learning
dc.subject	Embodied learning
dc.title	Look and listen : from semantic to spatial audio-visual perception
dc.type	Thesis
dc.type.material	text
thesis.degree.department	Computer Sciences
thesis.degree.discipline	Computer Science
thesis.degree.grantor	The University of Texas at Austin
thesis.degree.level	Doctoral
thesis.degree.name	Doctor of Philosophy

Access full-text files

Original bundle

Now showing 1 - 1 of 1

Name:: GAO-DISSERTATION-2021.pdf
Size:: 43.4 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: PROQUEST_LICENSE.txt
Size:: 4.45 KB
Format:: Plain Text
Description:

Download

Name:: LICENSE.txt
Size:: 1.84 KB
Format:: Plain Text
Description:

Download

Collections

UT Electronic Theses and Dissertations