Show simple item record

dc.contributor.advisorMooney, Raymond J. (Raymond Joseph)en
dc.creatorBunescu, Razvan Constantin, 1975-en
dc.date.accessioned2008-08-28T23:33:22Zen
dc.date.available2008-08-28T23:33:22Zen
dc.date.issued2007-08en
dc.identifierb68792815en
dc.identifier.urihttp://hdl.handle.net/2152/3200en
dc.descriptiontexten
dc.description.abstractInformation Extraction, the task of locating textual mentions of specific types of entities and their relationships, aims at representing the information contained in text documents in a structured format that is more amenable to applications in data mining, question answering, or the semantic web. The goal of our research is to design information extraction models that obtain improved performance by exploiting types of evidence that have not been explored in previous approaches. Since designing an extraction system through introspection by a domain expert is a laborious and time consuming process, the focus of this thesis will be on methods that automatically induce an extraction model by training on a dataset of manually labeled examples. Named Entity Recognition is an information extraction task that is concerned with finding textual mentions of entities that belong to a predefined set of categories. We approach this task as a phrase classification problem, in which candidate phrases from the same document are collectively classified. Global correlations between candidate entities are captured in a model built using the expressive framework of Relational Markov Networks. Additionally, we propose a novel tractable approach to phrase classification for named entity recognition based on a special Junction Tree representation. Classifying entity mentions into a predefined set of categories achieves only a partial disambiguation of the names. This is further refined in the task of Named Entity Disambiguation, where names need to be linked to their actual denotations. In our research, we use Wikipedia as a repository of named entities and propose a ranking approach to disambiguation that exploits learned correlations between words from the name context and categories from the Wikipedia taxonomy. Relation Extraction refers to finding relevant relationships between entities mentioned in text documents. Our approaches to this information extraction task differ in the type and the amount of supervision required. We first propose two relation extraction methods that are trained on documents in which sentences are manually annotated for the required relationships. In the first method, the extraction patterns correspond to sequences of words and word classes anchored at two entity names occurring in the same sentence. These are used as implicit features in a generalized subsequence kernel, with weights computed through training of Support Vector Machines. In the second approach, the implicit extraction features are focused on the shortest path between the two entities in the word-word dependency graph of the sentence. Finally, in a significant departure from previous learning approaches to relation extraction, we propose reducing the amount of required supervision to only a handful of pairs of entities known to exhibit or not exhibit the desired relationship. Each pair is associated with a bag of sentences extracted automatically from a very large corpus. We extend the subsequence kernel to handle this weaker form of supervision, and describe a method for weighting features in order to focus on those correlated with the target relation rather than with the individual entities. The resulting Multiple Instance Learning approach offers a competitive alternative to previous relation extraction methods, at a significantly reduced cost in human supervision.en
dc.format.mediumelectronicen
dc.language.isoengen
dc.rightsCopyright is held by the author. Presentation of this material on the Libraries' web site by University Libraries, The University of Texas at Austin was made possible under a limited license grant from the author who has retained all copyrights in the works.en
dc.subject.lcshNatural language processing (Computer science)en
dc.subject.lcshInformation storage and retrieval systemsen
dc.titleLearning for information extraction: from named entity recognition and disambiguation to relation extractionen
dc.description.departmentComputer Sciencesen
dc.identifier.oclc173649973en
dc.type.genreThesisen
thesis.degree.departmentComputer Sciencesen
thesis.degree.disciplineComputer Sciencesen
thesis.degree.grantorThe University of Texas at Austinen
thesis.degree.levelDoctoralen
thesis.degree.nameDoctor of Philosophyen


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record