Show simple item record

dc.contributor.advisorKrähenbühl, Philipp
dc.creatorZhou, Xingyi, 1994-
dc.date.accessioned2022-08-05T21:34:24Z
dc.date.available2022-08-05T21:34:24Z
dc.date.created2022-05
dc.date.issued2022-05-02
dc.date.submittedMay 2022
dc.identifier.urihttps://hdl.handle.net/2152/115145
dc.identifier.urihttp://dx.doi.org/10.26153/tsw/42046
dc.description.abstractLarge-scale well-curated datasets are the fuel of computer vision. However, most datasets only focus on one single domain with a specific task and a fixed label set. Computer vision models trained on a single dataset only apply to a subset of applications in the real world. The goal of my research is to remove the artificial barriers of datasets and make object recognition generalize in the wild. There should be one single computer vision model, not a zoo of dataset-specific models. The model should be trained on a diverse set of datasets and should be able to recognize objects from different data sources in all domains. Towards this goal, my thesis focuses on three aspects: a point-based object representation that unified multiple vision tasks, a unified framework that detects and tracks objects through time, and a unified vocabulary between detection and classification annotations. First, we propose to represent an object using the simplest-possible representation --- a point. All object properties, like its size, pose, depth, and velocities, are attributes of the points and are inferred from the point features. We developed a point-based object detector, CenterNet, using standard keypoint detection techniques. We extend this point-based detector to many vision tasks by just adding task-specific regression outputs. The point-based representation achieves the state-of-the-art-level performance and runs fast with a unified framework. Second, we show the point-based representation also simplifies linking objects through time. We extend our point-based detector into a local tracker by regression the inter-frame motion of each object. The resulting point-based tracker is efficient, accurate, robust, and unified under different domains, tasks, and framerates. Going further, we develop a tracker that associates and classifies objects from the whole video clip. The global association uses a transformer that looks at all objects in a long temporal window, and directly produces trajectories. Finally, we study how to extend the vocabulary of our recognition system. We explored two directions: 1) merging multiple object detection datasets in different vocabularies and domains with an automatic label-space unification algorithm; 2) introducing additional classification annotations with a much large vocabulary, i.e., twenty-thousand classes. The resulting unified detector has a broad vocabulary, is more robust to changes in the visual domain, and generalizes readily to new unseen environments and taxonomies.
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectObject detection
dc.subjectTracking
dc.subjectOpen-vocabulary
dc.titleTowards unified object recognition in the wild
dc.typeThesis
dc.date.updated2022-08-05T21:34:25Z
dc.contributor.committeeMemberMooney, Raymond J.
dc.contributor.committeeMemberZhu, Yuke
dc.contributor.committeeMemberRamanan, Deva
dc.description.departmentComputer Sciences
thesis.degree.departmentComputer Sciences
thesis.degree.disciplineComputer Science
thesis.degree.grantorThe University of Texas at Austin
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy
dc.creator.orcid0000-0002-0914-8525
dc.type.materialtext


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record