Learning human activities and poses with interconnected data sources
MetadataShow full item record
Understanding human actions and poses in images or videos is a challenging problem in computer vision. There are different topics related to this problem such as action recognition, pose estimation, human-object interaction, and activity detection. Knowledge of actions and poses could benefit many applications, including video search, surveillance, auto-tagging, event detection, and human-computer interfaces. To understand humans' actions and poses, we need to address several challenges. First, humans are able to perform an enormous amount of poses. For example, simply to move forward, we can do crawling, walking, running, and sprinting. These poses all look different and require examples to cover these variations. Second, the appearance of a person's pose changes when looking from different viewing angles. The learned action model needs to cover the variations from different views. Third, many actions involve interactions between people and other objects, so we need to consider the appearance change corresponding to that object as well. Fourth, collecting such data for learning is difficult and expensive. Last, even if we can learn a good model for an action, to localize when and where the action happens in a long video remains a difficult problem due to the large search space. My key idea to alleviate these obstacles in learning humans' actions and poses is to discover the underlying patterns that connect the information from different data sources. Why will there be underlying patterns? The intuition is that all people share the same articulated physical structure. Though we can change our pose, there are common regulations that limit how our pose can be and how it can move over time. Therefore, all types of human data will follow these rules and they can serve as prior knowledge or regularization in our learning framework. If we can exploit these tendencies, we are able to extract additional information from data and use them to improve learning of humans' actions and poses. In particular, we are able to find patterns for how our pose could vary over time, how our appearance looks in a specific view, how our pose is when we are interacting with objects with certain properties, and how part of our body configuration is shared across different poses. If we could learn these patterns, they can be used to interconnect and extrapolate the knowledge between different data sources. To this end, I propose several new ways to connect human activity data. First, I show how to connect snapshot images and videos by exploring the patterns of how our pose could change over time. Building on this idea, I explore how to connect humans' poses across multiple views by discovering the correlations between different poses and the latent factors that affect the viewpoint variations. In addition, I consider if there are also patterns connecting our poses and nearby objects when we are interacting with them. Furthermore, I explore how we can utilize the predicted interaction as a cue to better address existing recognition problems including image re-targeting and image description generation. Finally, after learning models effectively incorporating these patterns, I propose a robust approach to efficiently localize when and where a complex action happens in a video sequence. The variants of my proposed approaches offer a good trade-off between computational cost and detection accuracy. My thesis exploits various types of underlying patterns in human data. The discovered structure is used to enhance the understanding of humans' actions and poses. By my proposed methods, we are able to 1) learn an action with very few snapshots by connecting them to a pool of label-free videos, 2) infer the pose for some views even without any examples by connecting the latent factors between different views, 3) predict the location of an object that a person is interacting with independent of the type and appearance of that object, then use the inferred interaction as a cue to improve recognition, and 4) localize an action in a complex long video. These approaches improve existing frameworks for understanding humans' actions and poses without extra data collection cost and broaden the problems that we can tackle.