Recognizing human activities from low-resolution videos
Human activity recognition is one of the intensively studied areas in computer vision. Most existing works do not assume video resolution to be a problem due to general applications of interests. However, with continuous concerns about global security and emerging needs for intelligent video analysis tools, activity recognition from low-resolution and low-quality videos has become a crucial topic for further research. In this dissertation, We present a series of approaches which are developed specifically to address the related issues regarding low-level image preprocessing, single person activity recognition, and human-vehicle interaction reasoning from low-resolution surveillance videos.
Human cast shadows are one of the major issues which adversely effect the performance of an activity recognition system. This is because human shadow direction varies depending on the time of the day and the date of the year. To better resolve this problem, we propose a shadow removal technique which effectively eliminates a human shadow cast from a light source of unknown direction. A multi-cue shadow descriptor is employed to characterize the distinctive properties of shadows. Our approach detects, segments, and then removes shadows.
We propose two different methods to recognize single person actions and activities from low-resolution surveillance videos. The first approach adopts a joint feature histogram based representation, which is the concatenation of subspace projected gradient and optical flow features in time. However, in this problem, the use of low-resolution, coarse, pixel-level features alone limits the recognition accuracy. Therefore, in the second work, we contributed a novel mid-level descriptor, which converts an activity sequence into simultaneous temporal signals at body parts. With our representation, activities are recognized through both the local video content and the short-time spectral properties of body parts' movements. We draw the analogies between activity and speech recognition and show that our speech-like representation and recognition scheme improves recognition performance in several low-resolution datasets.
To complete the research on this subject, we also tackle the challenging problem of recognizing human-vehicle interactions from low-resolution aerial videos. We present a temporal logic based approach which does not require training from event examples. At the low-level, we employ dynamic programming to perform fast model fitting between the tracked vehicle and the rendered 3-D vehicle models. At the semantic-level, given the localized event region of interest (ROI), we verify the time series of human-vehicle spatial relationships with the pre-specified event definitions in a piecewise fashion. Our framework can be generalized to recognize any type of human-vehicle interaction from aerial videos.