A hierarchical graphical model for recognizing human actions and interactions in video
Understanding human behavior in video data is essential in numerous applications including smart surveillance, video annotation/retrieval, and human – computer interaction. Recognizing human interactions is a challenging task due to ambiguity in body articulation, mutual occlusion, and shadows. Past research has focused on a coarse-level recognition of human interactions or on the recognition of a specific gesture of a single body part. It is our objective to develop methods to recognize human actions and interactions at a detailed level. The focus of this research is to develop a framework for recognizing human actions and interactions in color video. This dissertation presents a hierarchical graphical model that unifies multiple-level processing in video computing. The video – color image sequence – is processed at four levels: pixel level, blob level, object level, and event level. A mixture of Gaussian (MOG) model is used at the pixel level to train and classify individual pixel colors. A relaxation labeling with attribute relational graph (ARG) is used at the blob level to merge the pixels into coherent blobs and to register inter-blob relations. At the object level, the poses of individual body parts including head, torso, arms and legs are recognized using individual Bayesian networks (BNs), which are then integrated to obtain an overall body pose. At the event level, the actions of a single person are modeled using a dynamic Bayesian network (DBN) with temporal links between identical nodes of the Bayesian network at time t and t+1. At this event level, the results of the object-level descriptions for each person are juxtaposed along a common timeline to identify an interaction between two persons. The linguistic ‘verb argument structure’ is used to represent human action in terms of <agent-motion-target> triplets. Spatial and temporal constraints are used for a decision tree to recognize specific interactions. A meaningful semantic description in terms of <subject-verb-object> is obtained. Our method provides a user-friendly natural-language description of various human actions and interactions using event semantics. Our system correctly recognizes various human actions involving the motions of the torso, arms and/or legs, and our system achieves semantic descriptions of positive, neutral, and negative interactions between two persons including hand-shaking, standing hand-in-hand, and hugging as the positive interactions, approaching, departing, and pointing as the neutral interactions, and pushing, punching, and kicking as the negative interactions.