Methods and applications of text-driven toponym resolution with indirect supervision
MetadataShow full item record
This thesis addresses the problem of toponym resolution. Given an ambiguous placename like Springfield in some natural language context, the task is to automatically predict the location on the earth's surface the author is referring to. Many previous efforts use hand-built heuristics to attempt to solve this problem, looking for specific words in close proximity such as Springfield, Illinois, and disambiguating any remaining toponyms to possible locations close to those already resolved. Such approaches require the data to take a fairly specific form in order to perform well, thus they often have low coverage. Some have applied machine learning to this task in an attempt to build more general resolvers, but acquiring large amounts of high quality hand-labeled training material is difficult. I discuss these and other approaches found in previous work before presenting several new toponym resolvers that rely neither on hand-labeled training material prepared explicitly for this task nor on particular co-occurrences of toponyms in close proximity in the data to be disambiguated. Some of the resolvers I develop reflect the intuition of many heuristic resolvers that toponyms nearby in text tend to (but do not always) refer to locations nearby on Earth, but do not require toponyms to occur in direct sequence with one another. I also introduce several resolvers that use the predictions of a document geolocation system (i.e. one that predicts a location for a piece of text of arbitrary length) to inform toponym disambiguation. Another resolver takes into account these document-level location predictions, knowledge of different administrative levels (country, state, city, etc.), and predictions from a logistic regression classifier trained on automatically extracted training instances from Wikipedia in a probabilistic way. It takes advantage of all content words in each toponym's context (both local window and whole document) rather than only toponyms. One resolver I build that extracts training material for a machine learned classifier from Wikipedia, taking advantage of link structure and geographic coordinates on articles, resolves 83% of toponyms in a previously introduced corpus of news articles correctly, beating the strong but simplistic population baseline. I introduce a corpus of Civil War related writings not previously used for this task on which the population baseline does poorly; combining a Wikipedia informed resolver with an algorithm that seeks to minimize the geographic scope of all predicted locations in a document achieves 86% blind test set accuracy on this dataset. After providing these high performing resolvers, I form the groundwork for more flexible and complex approaches by transforming the problem of toponym resolution into the traveling purchaser problem, modeling the probability of a location given its toponym's textual context and the geographic distribution of all locations mentioned in a document as two components of an objective function to be minimized. As one solution to this incarnation of the traveling purchaser problem, I simulate properties of ants traveling the globe and disambiguating toponyms. The ants' preferences for various kinds of behavior evolves over time, revealing underlying patterns in the corpora that other disambiguation methods do not account for. I also introduce several automated visualizations of texts that have had their toponyms resolved. Given a resolved corpus, these visualizations summarize the areas of the globe mentioned and allow the user to refer back to specific passages in the text that mention a location of interest. One visualization presented automatically generates a dynamic tour of the corpus, showing changes in the area referred to by the text as it progresses. Such visualizations are an example of a practical application of work in toponym resolution, and could be used by scholars interested in the geographic connections in any collection of text on both broad and fine-grained levels.