Hybrid representations in 3D vision




Zhang, Zaiwei, Ph. D.

Journal Title

Journal ISSN

Volume Title



Equipping machines with the ability to understand and process visual content from 3D sensors is important for enabling them to reason about the inherently 3D world we live in. Due to the high cost of 3D scanners, the availability of large-scale 3D dataset used to be scarce. Existing data-driven approaches have been mainly focusing on how to leverage 2D images for 3D understanding, while raw 3D scans are mainly used for research in geometric reconstruction. Recently, with cheaper hardware and hence broader availability of consumer-grade 3D cameras (e.g. Microsoft Kinect, Intel RealSense, iPhone12 Pro Max), several large-scale 3D datasets have been created. These datasets cover a variety of object categories and different indoor/outdoor environments--some in the form of raw scans, and some as reconstructed 3D meshes, raising unique challenges and opportunities for developing novel data-driven approaches. Data-driven 3D vision as a field has experienced an increasing amount of research interest. In this thesis, we look at a unique field in 3D visual learning: data representation. 3D data can be represented in different forms, including point clouds, voxels and meshes, each having its own representational or computational advantages. Existing work in 3D vision has studied how to separately leverage each representation for various downstream tasks. The problem of how to select "good" input and output representations for a particular visual learning task has become an important research topic. However, since 3D data can be transformed into one another, it is not a constraint to choose between representations, but rather, we should develop algorithms that can leverage different or multiple representations at the same time. In this work, we study the benefits of employing multiple data representations, namely hybrid representations, to solve various 3D vision problems. We show three major benefits of applying hybrid representations in this thesis: 1) Joint Learning from Multi-representation Supervisions; 2) Complementary Feature Learning; 3) Self-supervision Constraints for Unsupervised Learning. We demonstrate learning frameworks for indoor scene modeling, novel view image synthesis, sparse view 3D reconstruction, 3D object detection, 3D scene segmentation, and self-supervised feature pretraining. In each framework, hybrid representations serve as an essential component and significantly improve the performance in each task.


LCSH Subject Headings