Browsing by Subject "Video segmentation"

Now showing 1 - 2 of 2

Human machine collaboration for foreground segmentation in images and videos
(2017-05) Jain, Suyog Dutt; Grauman, Kristen Lorraine, 1979-; Mooney, Raymond; Corso, Jason; Niekum, Scott; Vouga, Paul Etienne
Foreground segmentation is defined as the problem of generating pixel level foreground masks for all the objects in a given image or video. Accurate foreground segmentations in images and videos have several potential applications such as improving search, training richer object detectors, image synthesis and re-targeting, scene and activity understanding, video summarization, and post-production video editing. One effective way to solve this problem is human-machine collaboration. The main idea is to let humans guide the segmentation process through some partial supervision. As humans, we are extremely good at perception and can easily identify the foreground regions. Computers, on the other hand, lack this capability, but are extremely good at continuously processing large volumes of data at the lowest level of detail with great efficiency. Bringing these complementary strengths together can lead to systems which are accurate and cost-effective at the same time. However, in any such human-machine collaboration system, cost effectiveness and higher accuracy are competing goals. While more involvement from humans can certainly lead to higher accuracy, it also leads to increased cost both in terms of time and money. On the other hand, relying more on machines is cost-effective, but algorithms are still nowhere near human-level performance. Balancing this cost versus accuracy trade-off holds the key behind success for such a hybrid system. In this thesis, I develop foreground segmentation algorithms which effectively and efficiently make use of human guidance for accurately segmenting foreground objects in images and videos. The algorithms developed in this thesis actively reason about the best modalities or interactions through which a user can provide guidance to the system for generating accurate segmentations. At the same time, these algorithms are also capable of prioritizing human guidance on instances where it is most needed. Finally, when structural similarity exists within data (e.g., adjacent frames in a video or similar images in a collection), the algorithms developed in this thesis are capable of propagating information from instances which have received human guidance to the ones which did not. Together, these characteristics result in a substantial savings in human annotation cost while generating high quality foreground segmentations in images and videos. In this thesis, I consider three categories of segmentation problems all of which can greatly benefit from human-machine collaboration. First, I consider the problem of interactive image segmentation. In traditional interactive methods a human annotator provides a coarse spatial annotation (e.g., bounding box or freehand outlines) around the object of interest to obtain a segmentation. The mode of manual annotation used affects both its accuracy and ease-of-use. Whereas existing methods assume a fixed form of input no matter the image, in this thesis I propose a data-driven algorithm which learns whether an interactive segmentation method will succeed if initialized with a given annotation mode. This allows us to predict the modality that will be sufficiently strong to yield a high quality segmentation for a given image and results in large savings in annotation costs. I also propose a novel interactive segmentation algorithm called Click Carving which can accurately segment objects in images and videos using a very simple form of human interaction---point clicks. It outperforms several state-of-the-art methods and requires only a fraction of human effort in comparison. Second, I consider the problem of segmenting images in a weakly supervised image collection. Here, we are given a collection of images all belonging to the same object category and the goal is to jointly segment the common object from all the images. For this, I develop a stagewise active approach to segmentation propagation: in each stage, the images that appear most valuable for human annotation are actively determined and labeled by human annotators, then the foreground estimates are revised in all unlabeled images accordingly. In order to identify images that, once annotated, will propagate well to other examples, I introduce an active selection procedure that operates on the joint segmentation graph over all images. It prioritizes human intervention for those images that are uncertain and influential in the graph, while also mutually diverse. Building on this, I also introduce the problem of measuring compatibility between image pairs for joint segmentation. I show that restricting the joint segmentation to only compatible image pairs results in an improved joint segmentation performance. Finally, I propose a semi-supervised approach for segmentation propagation in video. Given human supervision in some frames of a video, this information can be propagated through time. The main challenge is that the foreground object may move quickly in the scene at the same time its appearance and shape evolves over time. To address this, I propose a higher order supervoxel label consistency potential which leverages bottom-up supervoxels to enforce long-range temporal consistency during propagation. I also introduce the notion of a generic pixel-level objectness in images and videos by training a deep neural network which uses appearance and motion to automatically assign a score to each pixel capturing its likelihood to be an "object" or "background". I show that the human guidance in the semi-supervised propagation algorithm can be further augmented with the generic pixel-objectness scores to obtain an even more accurate foreground segmentation in videos. Throughout, I provide extensive evaluation on challenging datasets and also compare with many state-of-the-art methods and other baselines validating the strengths of proposed algorithms. The outcomes across several different experiments show that the proposed human-machine collaboration algorithms achieve accurate segmentation of foreground objects in images and videos while saving a large amount of human annotation effort.
Learning to compose photos and videos from passive cameras
(2019-09-16) Xiong, Bo; Grauman, Kristen Lorraine, 1979-; Hays, James; Huang, Qixing; Niekum, Scott
Photo and video overload is well-known to most computer users. With cameras on mobile devices, it is all too easy to snap images and videos spontaneously, yet it remains much less easy to organize or search through that content later. With increasingly portable wearable and 360° computing platforms, the overload problem is only intensifying. Wearable and 360° cameras passively record everything they observe, unlike traditional cameras that require active human attention to capture images or videos. In my thesis, I explore the idea of automatically composing photos and videos from unedited videos captured by "passive" cameras. Passive cameras (e.g., wearable cameras, 360° cameras) offer a more relaxing experience to record our visual world but they do not always capture frames that look like intentional human-taken photos. In wearable cameras, many frames will be blurry, contain poorly composed shots, and/or simply have uninteresting content. In 360° cameras, a single omni-directional image captures the entire visual world, and the photographer's intention and attention in that moment are unknown. To this end, I consider the following problems in the context of passive cameras: 1) what visual data to capture and store, 2) how to identify foreground objects, and 3) how to enhance the viewing experience. First, I explore the problem of finding the best moments in unedited videos. Not everything observed in a wearable camera's video stream is worthy of being captured and stored. People can easily distinguish well-composed moments from accidental shots from a wearable camera. This prompts the question: can a vision system predict the best moments in unedited video? I first study how to find the best moments in terms of short video clips. My key insight is that video segments from shorter user-generated videos are more likely to be highlights than those from longer videos, since users tend to be more selective about the content when capturing shorter videos. Leveraging this insight, I introduce a novel ranking framework to learn video highlight detection from unlabeled videos. Next, I show how to predict snap points in unedited video---that is, those frames that look like intentionally taken photos. I propose a framework to detect snap points that requires no human annotations. The main idea is to construct a generative model of what human-taken photos look like by sampling images posted on the Web. Snapshots that people upload to share publicly online may vary vastly in their content, yet all share the key facet that they were intentional snap point moments. This makes them an ideal source of positive exemplars for our target learning problem. In both settings, despite learning without any explicit labels, my proposed models outperform discriminative baselines trained with labeled data. Next, I introduce a novel approach to automatically segment foreground objects in images and videos. Identifying key objects is an important intermediate step for automatic photo composition. It is also a prerequisite in graphics applications like image retargeting, production video editing, and rotoscoping. Given an image or video frame, the goal is to determine the likelihood that each pixel is part of a foreground object. I formulate the task as a structured prediction problem of assigning an object/background label to each pixel (pixel objectness), and I propose an end-to-end trainable model that draws on the respective strengths of generic object appearance and motion in a unified framework. Since large-scale video datasets with pixel level segmentations are problematic, I show how to bootstrap weakly annotated videos together with existing image recognition datasets for training. In addition, I demonstrate how the proposed approach benefits image retrieval and image retargeting. Through experiments on multiple challenging image and video segmentation benchmarks, our method offers consistently strong results and improves the state-of-the-art results for fully automatic segmentation of foreground objects. Building on the proposed foreground segmentation method, I finally explore how to predict viewing angles to enhance photo composition after identifying those foreground objects. Specifically, I introduce snap angle prediction for 360° panoramas, which are a rich medium, yet notoriously difficult to visualize in the 2D image plane. I explore how intelligent rotations of a spherical image may enable content-aware projection with fewer perceptible distortions. Whereas existing approaches assume the viewpoint is fixed, intuitively some viewing angles within the sphere preserve high-level objects better than others. To discover the relationship between these optimal snap angles and the spherical panorama's content, I develop a reinforcement learning approach for the cubemap projection model. Implemented as a deep recurrent neural network, our method selects a sequence of rotation actions and receives reward for avoiding cube boundaries that overlap with important foreground objects. Our proposed method offers a 5x speedup compared to exhaustive search. Throughout, I validate the strength of the proposed frameworks on multiple challenging datasets against a variety of previously established state-of-the-art methods and other pertinent baselines. Our experiments demonstrate the following: 1) our method can automatically identify the best moments from unedited videos; 2) our segmentation method substantially improves the state-of-the-art on foreground segmentation in images and videos and also benefits automatic photo composition; 3) our viewing angle prediction for 360° imagery can enhance the viewing experience. Although my thesis mainly focuses on passive cameras, a portion of the proposed methods are also applicable to general user generated images and videos.