Data- and compute-efficient visual recognition and generation



Journal Title

Journal ISSN

Volume Title



The remarkable advancements in deep learning for visual recognition and generation have often been accompanied by a significant computational burden. As the complexity of deep learning models escalates, achieving efficiency in both architecture construction and data utilization becomes paramount. This dissertation examines two fundamental categories of efficiency: model efficiency and data efficiency. 1. Model Efficiency: This facet of the study focuses on reducing the computational cost of deep neural networks without compromising performance. Through neural architecture search (NAS), we discover highly efficient models tailored for video action recognition. Our novel approach to multi-stream multivariate search space has led to the discovery of two-stream models like Auto-TSNet, dramatically reducing FLOPs and improving accuracy over standard benchmarks. 2. Data Efficiency: Data efficiency in deep learning relates to the model’s capacity to learn effectively from a limited dataset. This characteristic is particularly valuable when gathering or labeling extensive data is either prohibitive or infeasible. Specifically, the dissertation focuses on data efficiency for the downstream task generalization of pre-trained models, recognizing their significant role in advancing the field. Our study addresses two main challenges within this domain: (a) Incremental Few-shot Learning (IFL): IFL represents a nuanced challenge in deep learning, requiring the model to learn new categories using few examples, without forgetting previously learned information. In the context of this dissertation, we investigate IFL in two essential domains: object detection and image generation. For object detection, we introduce a weakly supervised approach, WS-iFSD, that substantially augments meta-training, outperforming existing methods across key benchmarks. In image generation, we propose EI-GAN, an efficient generative model that incrementally registers new categories without revisiting extra data or experiencing catastrophic forgetting. Together, we demonstrate significant advancements in the ability to learn and generalize from limited data. (b) Multimodal Generalization (MMG): MMG is a novel focus in this dissertation, addressing how systems adapt when certain modalities are limited or absent. Specifically, we introduce two unique evaluation methods: 1) Missing Modality Evaluation, which tests the system’s ability to function without some modalities present during training, and 2) Cross-modal Zero-shot Evaluation, which evaluates performance when the training and inference modalities are entirely disjoint. Our exploration of these challenges, along with the creation of new models and a dataset, MMG-Ego4D, highlights our emphasis on the efficiency of generalization, contributing vital insights to the field of multimodal learning and adaptation. The intertwined exploration of model and data efficiency contributes new methodologies and constructs a deeper understanding of efficiency in deep learning. By bridging the gap between high performance and computational frugality, this dissertation paves the way for more sustainable and adaptable deep learning applications in the fields of visual recognition and generation.


LCSH Subject Headings