Optimizing visual grounding of latent representations of speech from distant language groups

Access full-text files

Date

2021-12-03

Authors

Crabtree, Christopher Edwin

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Recent years have seen an increasing research interest into using multi-modal grounding techniques to bolster classic natural language processing (NLP) and automated speech recognition (ASR) tasks. Previous work by Harwath et al. [5], demonstrated that visual grounding approximately doubled their model's bilingual utterance retrieval performance and similarly image retrieval was substantially improved by adding an alignment objective between languages. However, there is still much we don't know about the exact mechanism by which grounding is used in modern neural network systems. In this work, we extend the line of research pioneered by Harwath et al. by exploring empirically several contrastive learning frameworks and objectives designed to align input from different modalities (i.e. visual and speech input). Our experiments indicate potential avenues for improvement over the current best performing loss objective through analysis of our top two performing loss functions. We also find that in our trilingual setting, cross-lingual learning objectives can be removed to both improve image retrieval performance and reduce hyperparameter complexity

Description

LCSH Subject Headings

Citation