Towards scalable video understanding
MetadataShow full item record
Throughout our life, we humans perceive the visual world, connect what we see over time, and make sense of the world around us. Today's computer vision systems observe the same visual world, but do not see it as we do. They parse only independent snapshots, without connecting them to form a complete understanding. What limits the current computer vision systems? Historically, computer vision research has been advanced through careful choice-making given computational constraints. For example, from the 1970s to the 1990s, many vision systems were built on detected edges but not full images to save computations (e.g., [16, 68]). Similarly, in the 2000s, researchers built datasets on tiny (32x32) images . Many of these simplifications have been removed over time through advances in both hardware and algorithms. However, a long-lasting simplification remains popular: ``instead of dealing with the full visual stream in a video, let's focus on one single image." While this simplification drastically reduces the computational cost, it also discards useful signals, such as motion, 3D structures, long-term context, and many more. This potentially makes computer vision unnecessarily hard and misses out on the essential ingredients of our ultimate system. In this thesis, we aim to enable vision systems to efficiently operate on long videos, rather than just images, so that they can reason through time and more deeply understand the visual world around us. Towards this goal, in Part I, we study and propose methods to address the scalability issues from multiple aspects of the status quo pipeline, ranging from recognition to compression and training. We show how to leverage the temporal structures of videos, such as redundancy, for efficient processing of videos without sacrificing performance. More importantly, in Part II, we verify that being able to model the long-term connections between visual signals over time is indeed advantageous. In particular, we propose new models for long-form videos and demonstrate a significant performance gain over existing image-based or short-term video models. Furthermore, we show that enabling vision models to operate on long-form videos also enables understanding of the `full picture' of a long visual stream. We show how to achieve this by analyzing the complex synergy between all objects and people in a long-form video. Finally, to facilitate future research, we introduce a new benchmark for long-form video understanding.