Unsupervised partial parsing
MetadataShow full item record
The subject matter of this thesis is the problem of learning to discover grammatical structure from raw text alone, without access to explicit instruction or annotation -- in particular, by a computer or computational process -- in other words, unsupervised parser induction, or simply, unsupervised parsing. This work presents a method for raw text unsupervised parsing that is simple, but nevertheless achieves state-of-the-art results on treebank-based direct evaluation. The approach to unsupervised parsing presented in this dissertation adopts a different way to constrain learned models than has been deployed in previous work. Specifically, I focus on a sub-task of full unsupervised partial parsing called unsupervised partial parsing. In essence, the strategy is to learn to segment a string of tokens into a set of non-overlapping constituents or chunks which may be one or more tokens in length. This strategy has a number of advantages: it is fast and scalable, based on well-understood and extensible natural language processing techniques, and it produces predictions about human language structure which are useful for human language technologies. The models developed for unsupervised partial parsing recover base noun phrases and local constituent structure with high accuracy compared to strong baselines. Finally, these models may be applied in a cascaded fashion for the prediction of full constituent trees: first segmenting a string of tokens into local phrases, then re-segmenting to predict higher-level constituent structure. This simple strategy leads to an unsupervised parsing model which produces state-of-the-art results for constituent parsing of English, German and Chinese. This thesis presents, evaluates and explores these models and strategies.