Supervised language models for temporal resolution of text in absence of explicit temporal cues

Kumar, Abhimanu

Supervised language models for temporal resolution of text in absence of explicit temporal cues

dc.contributor.advisor	Ghosh, Joydeep
dc.creator	Kumar, Abhimanu	en
dc.date.accessioned	2014-03-18T20:17:30Z	en
dc.date.issued	2013-12	en
dc.date.submitted	December 2013	en
dc.date.updated	2014-03-18T20:17:30Z	en
dc.description	text	en
dc.description.abstract	This thesis explores the temporal analysis of text using the implicit temporal cues present in document. We consider the case when all explicit temporal expressions such as specific dates or years are removed from the text and a bag of words based approach is used for timestamp prediction for the text. A set of gold standard text documents with times- tamps are used as the training set. We also predict time spans for Wikipedia biographies based on their text. We have training texts from 3800 BC to present day. We partition this timeline into equal sized chronons and build a probability histogram for a test document over this chronon sequence. The document is assigned to the chronon with the highest probability. We use 2 approaches: 1) a generative language model with Bayesian priors, and 2) a KL divergence based model. To counter the sparsity in the documents and chronons we use 3 different smoothing techniques across models. We use 3 diverse datasets to test our mod- els: 1) Wikipedia Biographies, 2) Guttenberg Short Stories, and 3) Wikipedia Years dataset. Our models are trained on a subset of Wikipedia biographies. We concentrate on two prediction tasks: 1) time-stamp prediction for a generic text or mid-span prediction for a Wikipedia biography , and 2) life-span prediction for a Wikipedia biography. We achieve an f-score of 81.1% for life-span prediction task and a mean error of around 36 years for mid-span prediction for biographies from present day to 3800 BC. The best model gives a mean error of 18 years for publication date prediction for short stories that are uniformly distributed in the range 1700 AD to 2010 AD. Our models exploit the temporal distribu- tion of text for associating time. Our error analysis reveals interesting properties about the models and datasets used. We try to combine explicit temporal cues extracted from the document with its implicit cues and obtain combined prediction model. We show that a combination of the date-based predictions and language model divergence predictions is highly effective for this task: our best model obtains an f-score of 81.1% and the median error between actual and predicted life span midpoints is 6 years. This would be one of the emphasis for our future work. The above analyses demonstrates that there are strong temporal cues within texts that can be exploited statistically for temporal predictions. We also create good benchmark datasets along the way for the research community to further explore this problem.	en
dc.description.department	Computer Sciences	en
dc.format.mimetype	application/pdf	en
dc.identifier.uri	http://hdl.handle.net/2152/23581	en
dc.subject	Supervised language models	en
dc.subject	Temporal resolution	en
dc.subject	Temporal cues	en
dc.subject	Information retrieval	en
dc.title	Supervised language models for temporal resolution of text in absence of explicit temporal cues	en
dc.type	Thesis	en
thesis.degree.department	Computer Sciences	en
thesis.degree.discipline	Computer Science	en
thesis.degree.grantor	The University of Texas at Austin	en
thesis.degree.level	Masters	en
thesis.degree.name	Master of Science in Computer Sciences	en

Access full-text files

Original bundle

Now showing 1 - 1 of 1

Name:: KUMAR-THESIS-2013.pdf
Size:: 447.57 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: LICENSE.txt
Size:: 1.84 KB
Format:: Plain Text
Description:

Download

Collections

UT Electronic Theses and Dissertations