Natural-language video description with deep recurrent neural networks

dc.contributor.advisorMooney, Raymond J. (Raymond Joseph)
dc.contributor.committeeMemberGrauman, Kristen
dc.contributor.committeeMemberStone, Peter
dc.contributor.committeeMemberSaenko, Kate
dc.contributor.committeeMemberDarrell, Trevor
dc.creatorVenugopalan, Subhashini
dc.creator.orcid0000-0003-3729-8456
dc.date.accessioned2017-12-13T15:43:14Z
dc.date.available2017-12-13T15:43:14Z
dc.date.created2017-08
dc.date.issued2017-08
dc.date.submittedAugust 2017
dc.date.updated2017-12-13T15:43:14Z
dc.description.abstractFor most people, watching a brief video and describing what happened (in words) is an easy task. For machines, extracting meaning from video pixels and generating a sentence description is a very complex problem. The goal of this thesis is to develop models that can automatically generate natural language descriptions for events in videos. It presents several approaches to automatic video description by building on recent advances in “deep” machine learning. The techniques presented in this thesis view the task of video description akin to machine translation, treating the video domain as a source “language” and uses deep neural net architectures to “translate” videos to text. Specifically, I develop video captioning techniques using a unified deep neural network with both convolutional and recurrent structure, modeling the temporal elements in videos and language with deep recurrent neural networks. In my initial approach, I adapt a model that can learn from paired images and captions to transfer knowledge from this auxiliary task to generate descriptions for short video clips. Next, I present an end-to-end deep network that can jointly model a sequence of video frames and a sequence of words. To further improve grammaticality and descriptive quality, I also propose methods to integrate linguistic knowledge from plain text corpora. Additionally, I show that such linguistic knowledge can help describe novel objects unseen in paired image/video-caption data. Finally, moving beyond short video clips, I present methods to process longer multi-activity videos, specifically to jointly segment and describe coherent event sequences in movies.
dc.description.departmentComputer Science
dc.format.mimetypeapplication/pdf
dc.identifierdoi:10.15781/T2QR4P68H
dc.identifier.urihttp://hdl.handle.net/2152/62987
dc.language.isoen
dc.subjectVideo
dc.subjectCaptioning
dc.subjectDescription
dc.subjectLSTM
dc.subjectRNN
dc.subjectRecurrent
dc.subjectNeural networks
dc.subjectImage captioning
dc.subjectVideo captioning
dc.subjectLanguage and vision
dc.titleNatural-language video description with deep recurrent neural networks
dc.typeThesis
dc.type.materialtext
thesis.degree.departmentComputer Sciences
thesis.degree.disciplineComputer Science
thesis.degree.grantorThe University of Texas at Austin
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy

Access full-text files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
VENUGOPALAN-DISSERTATION-2017.pdf
Size:
16.7 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
4.46 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
1.85 KB
Format:
Plain Text
Description: