Show simple item record

dc.contributor.advisorBaldridge, Jason
dc.creatorWing, Benjamin Patai
dc.date.accessioned2016-09-14T17:37:45Z
dc.date.available2016-09-14T17:37:45Z
dc.date.issued2015-12
dc.date.submittedDecember 2015
dc.identifierdoi:10.15781/T24T6F41H
dc.identifier.urihttp://hdl.handle.net/2152/40313
dc.description.abstractThis dissertation investigates automatic geolocation of documents (i.e. identification of their location, expressed as latitude/longitude coordinates), based on the text of those documents rather than metadata. I assert that such geolocation can be performed using text alone, at a sufficient accuracy for use in real-world applications. Although in some corpora metadata is found in abundance (e.g. home location, time zone, friends, followers, etc. in Twitter), it is lacking in others, such as many corpora of primary-source documents in the digital humanities, an area to which document geolocation has hardly been applied. To this end, I first develop methods for accurate text-based geolocation and then apply them to newly-annotated corpora in the digital humanities. The geolocation methods I develop use both uniform and adaptive (k-d tree) grids over the Earth’s surface, culminating in a hierarchical logistic-regression-based technique that achieves state of the art results on well-known corpora (Twitter user feeds, Wikipedia articles and Flickr image tags). In the second part of the dissertation I develop a new NLP task, text-based geolocation of historical corpora. Because there are no existing corpora to test on, I create and annotate two new corpora of significantly different natures (a 19th-century travel log and a large set of Civil War archives). I show how my methods produce good geolocation accuracy even given the relatively small amount of annotated data available, which can be further improved using domain adaptation. I then use the predictions on the much larger unannotated portion of the Civil War archives to generate and analyze geographic topic models, showing how they can be mined to produce interesting revelations concerning various Civil War-related subjects. Finally, I develop a new geolocation technique for text-only corpora involving co-training between document-geolocation and toponym- resolution models, using a gazetteer to inject additional information into the training process. To evaluate this technique I develop a new metric, the closest toponym error distance, on which I show improvements compared with a baseline geolocator.
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectGeolocation
dc.subjectComputational linguistics
dc.subjectNatural language processing
dc.subjectDigital humanities
dc.titleText-based document geolocation and its application to the digital humanities
dc.typeThesis
dc.date.updated2016-09-14T17:37:45Z
dc.contributor.committeeMemberErk, Katrin
dc.contributor.committeeMemberBeaver, David
dc.contributor.committeeMemberMooney, Ray
dc.contributor.committeeMemberLease, Matt
dc.description.departmentLinguistics
thesis.degree.departmentLinguistics
thesis.degree.disciplineLinguistics
thesis.degree.grantorThe University of Texas at Austin
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy
dc.creator.orcid0000-0001-9911-0186
dc.type.materialtext


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record