Browsing by Subject "Activity recognition"
Now showing 1 - 4 of 4
- Results Per Page
- Sort Options
Item Frugal Forests : learning a dynamic and cost sensitive feature extraction policy for anytime activity classification(2017-05) Kelle, Joshua Allen; Grauman, Kristen Lorraine, 1979-Many approaches to activity classification use supervised learning and so rely on extracting some form of features from the video. This feature extraction process can be computationally expensive. To reduce the cost of feature extraction while maintaining acceptable accuracy, we provide an anytime framework in which features are extracted one by one according to some policy. We propose our novel Frugal Forest feature extraction policy which learns a dynamic and cost sensitive ordering of the features. Cost sensitivity allows the policy to balance features’ predictive power with their extraction cost. The tree-like structure of the forest allows the policy to adjust on the fly in response to previously extracted feature values. We show through several experiments that the Frugal Forest policy exceeds or matches the classification accuracy per unit time of several baselines, including the current state of the art, on two challenging datasets and a variety of feature spaces.Item Learning human activities and poses with interconnected data sources(2016-05) Chen, Chao-Yeh; Grauman, Kristen Lorraine, 1979-; Aggarwal, Jake K.; Mooney, Raymond J.; Ramanan, Deva; Stone, PeterUnderstanding human actions and poses in images or videos is a challenging problem in computer vision. There are different topics related to this problem such as action recognition, pose estimation, human-object interaction, and activity detection. Knowledge of actions and poses could benefit many applications, including video search, surveillance, auto-tagging, event detection, and human-computer interfaces. To understand humans' actions and poses, we need to address several challenges. First, humans are able to perform an enormous amount of poses. For example, simply to move forward, we can do crawling, walking, running, and sprinting. These poses all look different and require examples to cover these variations. Second, the appearance of a person's pose changes when looking from different viewing angles. The learned action model needs to cover the variations from different views. Third, many actions involve interactions between people and other objects, so we need to consider the appearance change corresponding to that object as well. Fourth, collecting such data for learning is difficult and expensive. Last, even if we can learn a good model for an action, to localize when and where the action happens in a long video remains a difficult problem due to the large search space. My key idea to alleviate these obstacles in learning humans' actions and poses is to discover the underlying patterns that connect the information from different data sources. Why will there be underlying patterns? The intuition is that all people share the same articulated physical structure. Though we can change our pose, there are common regulations that limit how our pose can be and how it can move over time. Therefore, all types of human data will follow these rules and they can serve as prior knowledge or regularization in our learning framework. If we can exploit these tendencies, we are able to extract additional information from data and use them to improve learning of humans' actions and poses. In particular, we are able to find patterns for how our pose could vary over time, how our appearance looks in a specific view, how our pose is when we are interacting with objects with certain properties, and how part of our body configuration is shared across different poses. If we could learn these patterns, they can be used to interconnect and extrapolate the knowledge between different data sources. To this end, I propose several new ways to connect human activity data. First, I show how to connect snapshot images and videos by exploring the patterns of how our pose could change over time. Building on this idea, I explore how to connect humans' poses across multiple views by discovering the correlations between different poses and the latent factors that affect the viewpoint variations. In addition, I consider if there are also patterns connecting our poses and nearby objects when we are interacting with them. Furthermore, I explore how we can utilize the predicted interaction as a cue to better address existing recognition problems including image re-targeting and image description generation. Finally, after learning models effectively incorporating these patterns, I propose a robust approach to efficiently localize when and where a complex action happens in a video sequence. The variants of my proposed approaches offer a good trade-off between computational cost and detection accuracy. My thesis exploits various types of underlying patterns in human data. The discovered structure is used to enhance the understanding of humans' actions and poses. By my proposed methods, we are able to 1) learn an action with very few snapshots by connecting them to a pool of label-free videos, 2) infer the pose for some views even without any examples by connecting the latent factors between different views, 3) predict the location of an object that a person is interacting with independent of the type and appearance of that object, then use the inferred interaction as a cue to improve recognition, and 4) localize an action in a complex long video. These approaches improve existing frameworks for understanding humans' actions and poses without extra data collection cost and broaden the problems that we can tackle.Item On the motion and action prediction using deep graph models(2022-07-01) Mohamed, Abduallah Adel Omar; Tewfik, Ahmed; Claudel, Christian; Bovik, Alan; Thomaz, Edison; Boyles, StephenMotion and action prediction is a crucial component of autonomous systems and robotics. This component is vital for accidents predictions or prevention, motion planning, surveillance systems and behavior analysis. Typically, the problem involves the observation of multiple agents’ motion or actions across a span of time then predict the future motion and actions. Classical approaches which are model based failed in the complex situations that require a proper modeling of the interactions between the agents. Data driven approaches, specifically deep learning ones, had a better performance in modeling these interactions. Yet, these deep models were using classical deep learning techniques, such as recurrent and convolutional architectures. These approaches are not representative of the spatial and temporal aspects of the observations. This dissertation presents novel approaches using spatio-temporal graphs to model the observations. We model each agent as a graph node and define their spatial configuration by using the graph edges. For the temporal relationship, we represent it by propagating the graph nodes across time. This representation is powerful in capturing the spatial and temporal relationships in comparison with prior methods. Also, several deep architectures based on graph convolutional neural networks were investigated. These architectures were explored from different perspectives, such as the kernel function of the graph edges and the embedding representation of the spatio-temporal relationships in both observations and predictions. Also, the proposed models are shown to have real run-time capabilities in comparison with prior methods. Besides, the shortcomings of current evaluation metrics in assessing the quality of motion prediction models were investigated in which new metrics were proposed.Item Recognizing human activity using RGBD data(2014-05) Xia, Lu, active 21st century; Aggarwal, J. K. (Jagdishkumar Keshoram), 1936-Traditional computer vision algorithms try to understand the world using visible light cameras. However, there are inherent limitations of this type of data source. First, visible light images are sensitive to illumination changes and background clutter. Second, the 3D structural information of the scene is lost when projecting the 3D world to 2D images. Recovering the 3D information from 2D images is a challenging problem. Range sensors have existed for over thirty years, which capture 3D characteristics of the scene. However, earlier range sensors were either too expensive, difficult to use in human environments, slow at acquiring data, or provided a poor estimation of distance. Recently, the easy access to the RGBD data at real-time frame rate is leading to a revolution in perception and inspired many new research using RGBD data. I propose algorithms to detect persons and understand the activities using RGBD data. I demonstrate the solutions to many computer vision problems may be improved with the added depth channel. The 3D structural information may give rise to algorithms with real-time and view-invariant properties in a faster and easier fashion. When both data sources are available, the features extracted from the depth channel may be combined with traditional features computed from RGB channels to generate more robust systems with enhanced recognition abilities, which may be able to deal with more challenging scenarios. As a starting point, the first problem is to find the persons of various poses in the scene, including moving or static persons. Localizing humans from RGB images is limited by the lighting conditions and background clutter. Depth image gives alternative ways to find the humans in the scene. In the past, detection of humans from range data is usually achieved by tracking, which does not work for indoor person detection. In this thesis, I propose a model based approach to detect the persons using the structural information embedded in the depth image. I propose a 2D head contour model and a 3D head surface model to look for the head-shoulder part of the person. Then, a segmentation scheme is proposed to segment the full human body from the background and extract the contour. I also give a tracking algorithm based on the detection result. I further research on recognizing human actions and activities. I propose two features for recognizing human activities. The first feature is drawn from the skeletal joint locations estimated from a depth image. It is a compact representation of the human posture called histograms of 3D joint locations (HOJ3D). This representation is view-invariant and the whole algorithm runs at real-time. This feature may benefit many applications to get a fast estimation of the posture and action of the human subject. The second feature is a spatio-temporal feature for depth video, which is called Depth Cuboid Similarity Feature (DCSF). The interest points are extracted using an algorithm that effectively suppresses the noise and finds salient human motions. DCSF is extracted centered on each interest point, which forms the description of the video contents. This descriptor can be used to recognize the activities with no dependence on skeleton information or pre-processing steps such as motion segmentation, tracking, or even image de-noising or hole-filling. It is more flexible and widely applicable to many scenarios. Finally, all the features herein developed are combined to solve a novel problem: first-person human activity recognition using RGBD data. Traditional activity recognition algorithms focus on recognizing activities from a third-person perspective. I propose to recognize activities from a first-person perspective with RGBD data. This task is very novel and extremely challenging due to the large amount of camera motion either due to self exploration or the response of the interaction. I extracted 3D optical flow features as the motion descriptor, 3D skeletal joints features as posture descriptors, spatio-temporal features as local appearance descriptors to describe the first-person videos. To address the ego-motion of the camera, I propose an attention mask to guide the recognition procedures and separate the features on the ego-motion region and independent-motion region. The 3D features are very useful at summarizing the discerning information of the activities. In addition, the combination of the 3D features with existing 2D features brings more robust recognition results and make the algorithm capable of dealing with more challenging cases.