Show simple item record

dc.contributor.advisorBaldridge, Jason
dc.contributor.advisorMooney, Raymond J. (Raymond Joseph)
dc.creatorGarrette, Daniel Hunter
dc.date.accessioned2017-01-20T16:21:42Z
dc.date.available2017-01-20T16:21:42Z
dc.date.issued2015-05
dc.date.submittedMay 2015
dc.identifierdoi:10.15781/T2WM13X40
dc.identifier.urihttp://hdl.handle.net/2152/44478
dc.description.abstractThe best performing NLP models to date are learned from large volumes of manually-annotated data. For tasks like part-of-speech tagging and grammatical parsing, high performance can be achieved with plentiful supervised data. However, such resources are extremely costly to produce, making them an unlikely option for building NLP tools in under-resourced languages or domains. This dissertation is concerned with reducing the annotation required to learn NLP models, with the goal of opening up the range of domains and languages to which NLP technologies may be applied. In this work, we explore the possibility of learning from a degree of supervision that is at or close to the amount that could reasonably be collected from annotators for a particular domain or language that currently has none. We show that just a small amount of annotation input — even that which can be collected in just a few hours — can provide enormous advantages if we have learning algorithms that can appropriately exploit it. This work presents new algorithms, models, and approaches designed to learn grammatical information from weak supervision. In particular, we look at ways of intersecting a variety of different forms of supervision in complementary ways, thus lowering the overall annotation burden. Sources of information include tag dictionaries, morphological analyzers, constituent bracketings, and partial tree annotations, as well as unannotated corpora. For example, we present algorithms that are able to combine faster-to-obtain type-level annotation with unannotated text to remove the need for slower-to-obtain token-level annotation. Much of this dissertation describes work on Combinatory Categorial Grammar (CCG), a grammatical formalism notable for its use of structured, logic-backed categories that describe how each word and constituent fits into the overall syntax of the sentence. This work shows how linguistic universals intrinsic to the CCG formalism itself can be encoded as Bayesian priors to improve learning.
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectComputer science
dc.subjectArtificial intelligence
dc.subjectNatural language processing
dc.subjectMachine learning
dc.subjectBayesian statistics
dc.subjectGrammar induction
dc.subjectParsing
dc.subjectComputational linguistics
dc.titleInducing grammars from linguistic universals and realistic amounts of supervision
dc.typeThesis
dc.date.updated2017-01-20T16:21:42Z
dc.contributor.committeeMemberRavikumar, Pradeep
dc.contributor.committeeMemberScott, James G
dc.contributor.committeeMemberSmith, Noah A
dc.description.departmentComputer Sciences
thesis.degree.departmentComputer Sciences
thesis.degree.disciplineArtificial intelligence
thesis.degree.grantorThe University of Texas at Austin
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy
dc.type.materialtext


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record