Representation learning for multi-view 3D understanding
dc.contributor.advisor | Huang, Qixing | |
dc.contributor.committeeMember | Zhu, Yuke | |
dc.contributor.committeeMember | Liu, Qiang | |
dc.contributor.committeeMember | Durrett, Greg | |
dc.contributor.committeeMember | Anguelov, Dragomir | |
dc.creator | Yang, Zhenpei | |
dc.creator.orcid | 0000-0003-2717-5639 | |
dc.date.accessioned | 2022-11-16T00:45:17Z | |
dc.date.available | 2022-11-16T00:45:17Z | |
dc.date.created | 2022-08 | |
dc.date.issued | 2022-08-08 | |
dc.date.submitted | August 2022 | |
dc.date.updated | 2022-11-16T00:45:18Z | |
dc.description.abstract | Sensors record our physical world through their 2D projection, e.g., in the form of RGB or RGB-D images. Compared to single-view image, multi-view data offers abundant information and is becoming increasingly accessible due to hardware advances. Developing effective and efficient methods to link and aggregate signals from multiple views is a central step towards 3D vision and spatial AI in general, with rich downstream applications such as 3D reconstruction and 3D scene understanding. In this dissertation, we study how to design the representation of multi-view images for 3D understanding. One preliminary step in processing multi-view images is determining the camera pose for each image, which further enables building spatial-aware representations from multi-view images. We first study the core component of multi-view pose estimation, i.e. two-view relative pose estimation. Previous approaches usually assume a significant overlap between the two images and fail to handle the case of small overlap, which will occur in the case of sudden camera motion or few-view reconstruction. We show that by learning a complete-scene representation, we can improve relative camera pose estimation under a wide range of overlap conditions. Furthermore, we show considerable improvement built on top of this framework by learning a hybrid scene-completion model and adopting a global2local prediction procedure. The second major problem studied in this dissertation is building efficient multi-view representations from registered images. We first propose a 2D representation that can encode multi-view features efficiently in the local camera frame. Such a representation can be easily embedded into existing 2D convolutional neural networks and was demonstrated to be a fast alternative to 3D cost volume for accurate per-view depth estimation. We also propose a method to learn object representations for fast 3D reconstruction from a few images. We show how such a reconstruction system can tolerate noisy camera poses by jointly optimizing 3D representations and 2D feature alignment. We also discuss how geometric estimation from multi-view images could also be beneficial for semantic level inference tasks, such as multi-view 3D object detection. Finally, we study how to build a detail-preserving representation given Lidar and multi-view images in autonomous driving scenarios. Such a representation can be used for the synthesis of novel transversals of any visited scene, enabling photorealistic simulation testing. | |
dc.description.department | Computer Science | |
dc.format.mimetype | application/pdf | |
dc.identifier.uri | https://hdl.handle.net/2152/116705 | |
dc.identifier.uri | http://dx.doi.org/10.26153/tsw/43600 | |
dc.language.iso | en | |
dc.subject | Multi-view | |
dc.subject | 3D vision | |
dc.subject | Computer vision | |
dc.title | Representation learning for multi-view 3D understanding | |
dc.type | Thesis | |
dc.type.material | text | |
thesis.degree.department | Computer Sciences | |
thesis.degree.discipline | Computer Science | |
thesis.degree.grantor | The University of Texas at Austin | |
thesis.degree.level | Doctoral | |
thesis.degree.name | Doctor of Philosophy |
Access full-text files
Original bundle
1 - 1 of 1