Weakly supervised part-of-speech tagging for Chinese using label propagation

dc.contributor.advisorBaldridge, Jasonen
dc.contributor.committeeMemberErk, Katrinen
dc.creatorDing, Weiwei, 1985-en
dc.date.accessioned2012-02-02T19:39:17Zen
dc.date.available2012-02-02T19:39:17Zen
dc.date.issued2011-05en
dc.date.submittedMay 2011en
dc.date.updated2012-02-02T19:39:25Zen
dc.descriptiontexten
dc.description.abstractPart-of-speech (POS) tagging is one of the most fundamental and crucial tasks in Natural Language Processing. Chinese POS tagging is challenging because it also involves word segmentation. In this report, research will be focused on how to improve unsupervised Part-of-Speech (POS) tagging using Hidden Markov Models and the Expectation Maximization parameter estimation approach (EM-HMM). The traditional EM-HMM system uses a dictionary, which is used to constrain possible tag sequences and initialize the model parameters. This is a very crude initialization: the emission parameters are set uniformly in accordance with the tag dictionary. To improve this, word alignments can be used. Word alignments are the word-level translation correspondent pairs generated from parallel text between two languages. In this report, Chinese-English word alignment is used. The performance is expected to be better, as these two tasks are complementary to each other. The dictionary provides information on word types, while word alignment provides information on word tokens. However, it is found to be of limited benefit. In this report, another method is proposed. To improve the dictionary coverage and get better POS distribution, Modified Adsorption, a label propagation algorithm is used. We construct a graph connecting word tokens to feature types (such as word unigrams and bigrams) and connecting those tokens to information from knowledge sources, such as a small tag dictionary, Wiktionary, and word alignments. The core idea is to use a small amount of supervision, in the form of a tag dictionary and acquire POS distributions for each word (both known and unknown) and provide this as an improved initialization for EM learning for HMM. We find this strategy to work very well, especially when we have a small tag dictionary. Label propagation provides a better initialization for the EM-HMM method, because it greatly increases the coverage of the dictionary. In addition, label propagation is quite flexible to incorporate many kinds of knowledge. However, results also show that some resources, such as the word alignments, are not easily exploited with label propagation.en
dc.description.departmentLinguisticsen
dc.format.mimetypeapplication/pdfen
dc.identifier.slug2152/ETD-UT-2011-05-3193en
dc.identifier.urihttp://hdl.handle.net/2152/ETD-UT-2011-05-3193en
dc.language.isoengen
dc.subjectChinese part-of-speech taggingen
dc.subjectHidden Markov modelen
dc.subjectExpectation maximizationen
dc.subjectLabel propagationen
dc.titleWeakly supervised part-of-speech tagging for Chinese using label propagationen
dc.type.genrethesisen
thesis.degree.departmentLinguisticsen
thesis.degree.disciplineLinguisticsen
thesis.degree.grantorUniversity of Texas at Austinen
thesis.degree.levelMastersen
thesis.degree.nameMaster of Artsen

Access full-text files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
DING-MASTERS-REPORT.pdf
Size:
1.68 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.12 KB
Format:
Plain Text
Description: