Learning to compose photos and videos from passive cameras

Xiong, Bo

Learning to compose photos and videos from passive cameras

Access full-text files

XIONG-DISSERTATION-2019.pdf (31.41 MB)

Date

2019-09-16

Authors

Xiong, Bo

Abstract

Photo and video overload is well-known to most computer users. With cameras on mobile devices, it is all too easy to snap images and videos spontaneously, yet it remains much less easy to organize or search through that content later. With increasingly portable wearable and 360° computing platforms, the overload problem is only intensifying. Wearable and 360° cameras passively record everything they observe, unlike traditional cameras that require active human attention to capture images or videos. In my thesis, I explore the idea of automatically composing photos and videos from unedited videos captured by "passive" cameras. Passive cameras (e.g., wearable cameras, 360° cameras) offer a more relaxing experience to record our visual world but they do not always capture frames that look like intentional human-taken photos. In wearable cameras, many frames will be blurry, contain poorly composed shots, and/or simply have uninteresting content. In 360° cameras, a single omni-directional image captures the entire visual world, and the photographer's intention and attention in that moment are unknown. To this end, I consider the following problems in the context of passive cameras: 1) what visual data to capture and store, 2) how to identify foreground objects, and 3) how to enhance the viewing experience. First, I explore the problem of finding the best moments in unedited videos. Not everything observed in a wearable camera's video stream is worthy of being captured and stored. People can easily distinguish well-composed moments from accidental shots from a wearable camera. This prompts the question: can a vision system predict the best moments in unedited video? I first study how to find the best moments in terms of short video clips. My key insight is that video segments from shorter user-generated videos are more likely to be highlights than those from longer videos, since users tend to be more selective about the content when capturing shorter videos. Leveraging this insight, I introduce a novel ranking framework to learn video highlight detection from unlabeled videos. Next, I show how to predict snap points in unedited video---that is, those frames that look like intentionally taken photos. I propose a framework to detect snap points that requires no human annotations. The main idea is to construct a generative model of what human-taken photos look like by sampling images posted on the Web. Snapshots that people upload to share publicly online may vary vastly in their content, yet all share the key facet that they were intentional snap point moments. This makes them an ideal source of positive exemplars for our target learning problem. In both settings, despite learning without any explicit labels, my proposed models outperform discriminative baselines trained with labeled data. Next, I introduce a novel approach to automatically segment foreground objects in images and videos. Identifying key objects is an important intermediate step for automatic photo composition. It is also a prerequisite in graphics applications like image retargeting, production video editing, and rotoscoping. Given an image or video frame, the goal is to determine the likelihood that each pixel is part of a foreground object. I formulate the task as a structured prediction problem of assigning an object/background label to each pixel (pixel objectness), and I propose an end-to-end trainable model that draws on the respective strengths of generic object appearance and motion in a unified framework. Since large-scale video datasets with pixel level segmentations are problematic, I show how to bootstrap weakly annotated videos together with existing image recognition datasets for training. In addition, I demonstrate how the proposed approach benefits image retrieval and image retargeting. Through experiments on multiple challenging image and video segmentation benchmarks, our method offers consistently strong results and improves the state-of-the-art results for fully automatic segmentation of foreground objects. Building on the proposed foreground segmentation method, I finally explore how to predict viewing angles to enhance photo composition after identifying those foreground objects. Specifically, I introduce snap angle prediction for 360° panoramas, which are a rich medium, yet notoriously difficult to visualize in the 2D image plane. I explore how intelligent rotations of a spherical image may enable content-aware projection with fewer perceptible distortions. Whereas existing approaches assume the viewpoint is fixed, intuitively some viewing angles within the sphere preserve high-level objects better than others. To discover the relationship between these optimal snap angles and the spherical panorama's content, I develop a reinforcement learning approach for the cubemap projection model. Implemented as a deep recurrent neural network, our method selects a sequence of rotation actions and receives reward for avoiding cube boundaries that overlap with important foreground objects. Our proposed method offers a 5x speedup compared to exhaustive search. Throughout, I validate the strength of the proposed frameworks on multiple challenging datasets against a variety of previously established state-of-the-art methods and other pertinent baselines. Our experiments demonstrate the following: 1) our method can automatically identify the best moments from unedited videos; 2) our segmentation method substantially improves the state-of-the-art on foreground segmentation in images and videos and also benefits automatic photo composition; 3) our viewing angle prediction for 360° imagery can enhance the viewing experience. Although my thesis mainly focuses on passive cameras, a portion of the proposed methods are also applicable to general user generated images and videos.