Browsing by Subject "Geolocation"
Now showing 1 - 4 of 4
- Results Per Page
- Sort Options
Item Data and methods for Gazetteer Independent Toponym Resolution(2016-05) DeLozier, Grant Hollis; Baldridge, Jason; Erk, KatrinThis thesis looks at the computational task of Toponym Resolution from multiple perspectives. In its common form the task requires transforming a place name--e.g. Washington--into some grounded representation of that place, typically a point (latitude, longitude) geometry. In recent years Toponym Resolution (TR) systems have advanced beyond heuristic techniques into more complex machine learned classifiers and impressive gains have been made. Despite these advances, a number of issues remain with the task. This thesis looks at aspects of typical TR approaches in a critical light and proposes solutions and new methods. In particular, I'm critical of the dependence of existing approaches on gazetteer matching and under-utilization of complex geometric data types. I also outline some of the shortcomings in existing toponym corpora and detail a new corpus and annotation tool which I helped to develop.In earlier work I explored whether TR systems could be built without dependencies on gazetteer lookups. That work, which I expand and review in this thesis, showed that competitive accuracies can be achieved without using these human curated resources. Additionally, I demonstrate through error analysis that the largest advantage of a gazetteer matching component is with ontology correction and matching, and not with disambiguation or grounding.These new approaches are tested on pre-existing TR corpora, as well as a new corpus in a novel domain. In the process of detailing the new corpus, I remark on many challenges and design decisions that must be made in Toponym Resolution and propose a new evaluation metric.Item Data-rich document geotagging using geodesic grids(2011-05) Wing, Benjamin Patai; Baldridge, Jason; Erk, KatrinThis thesis investigates automatic geolocation (i.e. identification of the location, expressed as latitude/longitude coordinates) of documents. Geolocation can be an effective means of summarizing large document collections and is an important component of geographic information retrieval. We describe several simple supervised methods for document geolocation using only the document’s raw text as evidence. All of our methods predict locations in the context of geodesic grids of varying degrees of resolution. We evaluate the methods on geotagged Wikipedia articles and Twitter feeds. For Wikipedia, our best method obtains a median prediction error of just 11.8 kilometers. Twitter geolocation is more challenging: we obtain a median error of 479 km, an improvement on previous results for the dataset.Item Document geolocation using language models built from lexical and geographic similarity(2012-05) Skiles, Erik David; Baldridge, Jason; Erk, KatrinThis thesis investigates the automatic identification of the location of doc- uments. This process of geolocation aids in toponym resolution, document summarization, and geographic-based marketing. I focus on minimally su- pervised methods to examine both the lexical similarities and the geographic similarities between documents. This method predicts the location of a doc- ument as a single point on the earth’s surface. Three data sets are used to evaluate this method: a set of geotagged Wikipedia articles and two sets of Twitter feeds. For Wikipedia, the combined method obtains a median error of 12.1 kilometers and an improvement in mean error to 164 kilometers. The large Twitter data shows the greatest improvement from this method with a median error of 333 kilometers, down from the previous best of 463 kilometers.Item Text-based document geolocation and its application to the digital humanities(2015-12) Wing, Benjamin Patai; Baldridge, Jason; Erk, Katrin; Beaver, David; Mooney, Ray; Lease, MattThis dissertation investigates automatic geolocation of documents (i.e. identification of their location, expressed as latitude/longitude coordinates), based on the text of those documents rather than metadata. I assert that such geolocation can be performed using text alone, at a sufficient accuracy for use in real-world applications. Although in some corpora metadata is found in abundance (e.g. home location, time zone, friends, followers, etc. in Twitter), it is lacking in others, such as many corpora of primary-source documents in the digital humanities, an area to which document geolocation has hardly been applied. To this end, I first develop methods for accurate text-based geolocation and then apply them to newly-annotated corpora in the digital humanities. The geolocation methods I develop use both uniform and adaptive (k-d tree) grids over the Earth’s surface, culminating in a hierarchical logistic-regression-based technique that achieves state of the art results on well-known corpora (Twitter user feeds, Wikipedia articles and Flickr image tags). In the second part of the dissertation I develop a new NLP task, text-based geolocation of historical corpora. Because there are no existing corpora to test on, I create and annotate two new corpora of significantly different natures (a 19th-century travel log and a large set of Civil War archives). I show how my methods produce good geolocation accuracy even given the relatively small amount of annotated data available, which can be further improved using domain adaptation. I then use the predictions on the much larger unannotated portion of the Civil War archives to generate and analyze geographic topic models, showing how they can be mined to produce interesting revelations concerning various Civil War-related subjects. Finally, I develop a new geolocation technique for text-only corpora involving co-training between document-geolocation and toponym- resolution models, using a gazetteer to inject additional information into the training process. To evaluate this technique I develop a new metric, the closest toponym error distance, on which I show improvements compared with a baseline geolocator.