Representation learning for multi-view 3D understanding

dc.contributor.advisorHuang, Qixing
dc.contributor.committeeMemberZhu, Yuke
dc.contributor.committeeMemberLiu, Qiang
dc.contributor.committeeMemberDurrett, Greg
dc.contributor.committeeMemberAnguelov, Dragomir
dc.creatorYang, Zhenpei
dc.creator.orcid0000-0003-2717-5639
dc.date.accessioned2022-11-16T00:45:17Z
dc.date.available2022-11-16T00:45:17Z
dc.date.created2022-08
dc.date.issued2022-08-08
dc.date.submittedAugust 2022
dc.date.updated2022-11-16T00:45:18Z
dc.description.abstractSensors record our physical world through their 2D projection, e.g., in the form of RGB or RGB-D images. Compared to single-view image, multi-view data offers abundant information and is becoming increasingly accessible due to hardware advances. Developing effective and efficient methods to link and aggregate signals from multiple views is a central step towards 3D vision and spatial AI in general, with rich downstream applications such as 3D reconstruction and 3D scene understanding. In this dissertation, we study how to design the representation of multi-view images for 3D understanding. One preliminary step in processing multi-view images is determining the camera pose for each image, which further enables building spatial-aware representations from multi-view images. We first study the core component of multi-view pose estimation, i.e. two-view relative pose estimation. Previous approaches usually assume a significant overlap between the two images and fail to handle the case of small overlap, which will occur in the case of sudden camera motion or few-view reconstruction. We show that by learning a complete-scene representation, we can improve relative camera pose estimation under a wide range of overlap conditions. Furthermore, we show considerable improvement built on top of this framework by learning a hybrid scene-completion model and adopting a global2local prediction procedure. The second major problem studied in this dissertation is building efficient multi-view representations from registered images. We first propose a 2D representation that can encode multi-view features efficiently in the local camera frame. Such a representation can be easily embedded into existing 2D convolutional neural networks and was demonstrated to be a fast alternative to 3D cost volume for accurate per-view depth estimation. We also propose a method to learn object representations for fast 3D reconstruction from a few images. We show how such a reconstruction system can tolerate noisy camera poses by jointly optimizing 3D representations and 2D feature alignment. We also discuss how geometric estimation from multi-view images could also be beneficial for semantic level inference tasks, such as multi-view 3D object detection. Finally, we study how to build a detail-preserving representation given Lidar and multi-view images in autonomous driving scenarios. Such a representation can be used for the synthesis of novel transversals of any visited scene, enabling photorealistic simulation testing.
dc.description.departmentComputer Science
dc.format.mimetypeapplication/pdf
dc.identifier.urihttps://hdl.handle.net/2152/116705
dc.identifier.urihttp://dx.doi.org/10.26153/tsw/43600
dc.language.isoen
dc.subjectMulti-view
dc.subject3D vision
dc.subjectComputer vision
dc.titleRepresentation learning for multi-view 3D understanding
dc.typeThesis
dc.type.materialtext
thesis.degree.departmentComputer Sciences
thesis.degree.disciplineComputer Science
thesis.degree.grantorThe University of Texas at Austin
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy

Access full-text files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
YANG-DISSERTATION-2022.pdf
Size:
73.2 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
4.45 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
1.84 KB
Format:
Plain Text
Description: