Browsing by Subject "Representation learning"
Now showing 1 - 5 of 5
- Results Per Page
- Sort Options
Item DexV2A : vision pretraining for dexterous manipulation(2022-05-06) Stoken, Alex Harrison; Grauman, Kristen Lorraine, 1979-Achieving human-like dexterity for common daily tasks with a robotic hand is a challenging task for reinforcement learning. To combat the high dimensionality of the state-action space, we propose pre-training policies on low-level vision tasks. Our system, called DexV2A, first trains a neural network on structural vision tasks like edge detection, center point estimation and surface normal estimation to embed useful visual features into the network weights. When this network is transferred to a dexterous manipulation policy, it offers an advantageous initialization for task learning. We conduct experiments on four diverse manipulation tasks with a 30-DoF dexterous robotic hand in simulation. We show that for the tasks of opening, closing, pouring, and stirring, DexV2A improves policy learning over policies trained without any visual pre-training. Our experimental results demonstrate the effectiveness of our approach, and emphasizes the potency of visual pre-training over learning via direct experience.Item Look and listen : from semantic to spatial audio-visual perception(2021-07-24) Gao, Ruohan; Grauman, Kristen Lorraine, 1979-; Zisserman, Andrew; Mooney, Raymond; Huang, QixingUnderstanding scenes and events is inherently a multi-modal experience. We perceive the world by both looking and listening (and touching, smelling, and tasting). In particular, the sounds made by objects, whether actively generated or incidentally emitted, offer valuable signals about their physical properties and spatial locations—the cymbals crash on stage, the bird tweets up in the tree, the truck revs down the block, the silverware clinks in the drawer. However, while recognition has made significant progress by "looking"—detecting objects, actions, or people based on their appearance—it often does not listen. In this thesis, I show that audio that accompanies visual scenes and events can be used as a rich source of training signal for learning (audio-)visual models. Particularly, I have developed computational models that leverage both the semantic and spatial signals in audio to understand people, places, and things from continuous multi-modal observations. Below, I summarize my key contributions along these two themes: Audio as a semantic signal: First, I develop methods that learn how different objects sound by both looking at and listening to unlabeled video containing multiple sounding objects. I propose an unsupervised approach to separate mixed audio into its component sound sources by disentangling the audio frequency bases for detected visual objects. Next, I further propose a new approach that trains audio-visual source separation models on pairs of training videos. This co-separation framework permits both end-to-end training and learning object-level sounds from unlabeled videos of multiple sound sources. As an extension of the co-separation approach, then I study the classic cocktail party problem to separate voices from the speech mixture by leveraging the consistency between the speaker's facial appearance and their voice. The two modalities, vision and audition, are mutually beneficial. While visual objects are indicative of the sounds they make to enhance audio source separation, audio can also be informative of the visual events in videos. Finally, I propose a framework that uses audio as a semantic signal to help visual events classification. I design a preview mechanism to eliminate both short-term and long-term visual redundancies using audio for efficient action recognition in untrimmed video. Audio as a spatial signal: Both audio and visual data also convey significant spatial information. The two senses naturally work in concert to interpret spatial signals. Particularly, the human auditory system uses two ears to extract individual sound sources from a complex mixture. Leveraging the spatial signal in videos, I devise an approach to lift a flat monaural audio signal to binaural audio by injecting the spatial cues embedded in the accompanying visual frames. When listening to the predicted binaural audio—the 2.5D visual sound—listeners can then feel the locations of the sound sources as they are displayed in the video. Beyond learning from passively captured video, I next explore the spatial signal in audio by deploying an agent to actively interact with the environment using audio. I propose a novel representation learning framework that learns useful visual features via echolocation by capturing echo responses in photo-realistic 3D indoor scene environments. Experimental results demonstrate that the image features learned from echoes are comparable or even outperform heavily supervised pre-training methods for multiple fundamental spatial tasks—monocular depth prediction, surface normal estimation, and visual navigation. Our results serve as an exciting prompt for future work leveraging both the visual and audio modalities. Motivated by how we humans perceive and act in the world by making use of all our senses, the long-term goal of my research is to build systems that can perceive as well as we do by combining all the multisensory inputs. In the last chapter of my thesis, I outline the potential future research directions that I want to pursue beyond my Ph.D. dissertation.Item Novel approaches for learning representations(2022-12-01) Lotfi Rezaabad, Ali; Tamir, Jon (Jonathan I.); Vishwanath, Sriram; Kim, Hyeji; Thomaz, Edison; Williamson, SineadNewly developed machine learning algorithms are heavily dependent on the choice of data representation. Hereupon, the success of these methods generally relies on the quality and usefulness of data representation. For this reason, one significant body of deep learning research is dedicated to capturing/discovering useful representations, machine learning. As a result of this, there has been a surge in interest in unsupervised representation learning, which eliminates the requirement for the time-consuming and expensive labeling procedure. The emergence of variational autoencoders (VAEs) has opened a new avenue for unsupervised learning methods. While these techniques methods are classy in their approach, they are typically not useful for representation learning. One chapter of this dissertation aims to propose a simple and powerful framework based on VAEs that acts as encoder/decoder, and concurrently helps us discover more meaningful representations. To this end, my solution is to address the issue mentioned above by utilizing information-theoretic tools, which eventually promote amortized inference in VAEs. I call this approach InfoMax-VAE, as I propose to solve the problem based on mutual information. I also demonstrate that such an approach can significantly boost the quality of learned high-level representations. There are many interesting applications in physical, social, and information sciences that can be modeled based on relational data. In another chapter of this dissertation, I build off of semi-implicit graph variational auto-encoders to capture higher-order statistics in the latent features of a graph dataset. Indeed, I seek to address this by incorporating hyperbolic geometry in the latent space via a Poincaré embedding to efficiently represent graphs exhibiting hierarchical structure. To tackle the naive posterior latent distribution assumptions in variational inference, I utilize semi-implicit hierarchical variational Bayes to implicitly capture posteriors of given graph data, which may have multiple modes, skewness and highly correlated latent structures. Due to data bias and domain shift, the data used for training a neural network and those used for inference tend not to follow the same distribution. Unfortunately, it has been shown that the performance of neural networks is prone to slight differences between source and target distributions. Although, a variety of complex algorithms have lately been developed for this issue in classification, the similar challenge in unsupervised representation learning has seen very little investigations. I am particularly interested in the case where only a limited number of samples are available from the target domain due to the high cost of data collection. Therefore, to solve the issue of few-shot domain adaptation in unsupervised contrastive learning, I provide a novel method that requires neither samples from the source domain nor labels from the target domain are required.Item Unsupervised contrastive representation learning : a survey(2022-05-06) Ball, Kelsey; Sanghavi, Sujay Rajendra, 1979-; Mokhtari, AryanUnsupervised contrastive representation learning uses unlabeled data to learn a feature space in which similar inputs are closer together (in Euclidean distance) than dissimilar ones. An ideal feature space encodes relevant features from the input space, reducing the amount of labeled data needed for classification. In this paper, we survey theoretical and applied results for image and text representation learning that use unsupervised contrastive methods.Item Unsupervised learning for large-scale data(2019-09-20) Wu, Shanshan, Ph. D.; Sanghavi, Sujay Rajendra, 1979-; Dimakis, Alexandros G.; Caramanis, Constantine; Klivans, Adam R; Ward, Rachel AUnsupervised learning involves inferring the inherent structures or patterns from unlabeled data. Since there is no label information, the fundamental challenge of unsupervised learning is that the objective function is not explicitly defined. The ubiquity of large-scale datasets adds another layer of complexity to the overall learning problem. When the data size or dimension is large, even algorithms with quadratic runtime may be prohibitive. This thesis presents four large-scale unsupervised learning problems. We start with two density estimation problems: given samples from a one-layer ReLU generative model or a discrete pairwise graphical model, the goal is to recover the parameters of the generative model. We then move to representation learning of high-dimensional sparse data coming from one-hot encoded categorical features. We assume that there are additional but a-priori unknown structures in their support. The goal is to learn a lossless low-dimensional embedding for the given data. Our last problem is to compute low-rank approximations of a matrix product given the individual matrices. We are interested in the setting where the matrices are too large and can only be stored in the disk. For every problem presented in this thesis, we (i) design novel and efficient algorithms to capture the inherent structure from data in an unsupervised manner; (ii) establish theoretical guarantees and compare the empirical performance with the state-of-the-art methods; and (iii) provide source code to support our experimental findings