Word meaning in context as a paraphrase distribution : evidence, learning, and inference
MetadataShow full item record
In this dissertation, we introduce a graph-based model of instance-based, usage meaning that is cast as a problem of probabilistic inference. The main aim of this model is to provide a flexible platform that can be used to explore multiple hypotheses about usage meaning computation. Our model takes up and extends the proposals of Erk and Pado  and McCarthy and Navigli  by representing usage meaning as a probability distribution over potential paraphrases. We use undirected graphical models to infer this probability distribution for every content word in a given sentence. Graphical models represent complex probability distributions through a graph. In the graph, nodes stand for random variables, and edges stand for direct probabilistic interactions between them. The lack of edges between any two variables reflect independence assumptions. In our model, we represent each content word of the sentence through two adjacent nodes: the observed node represents the surface form of the word itself, and the hidden node represents its usage meaning. The distribution over values that we infer for the hidden node is a paraphrase distribution for the observed word. To encode the fact that lexical semantic information is exchanged between syntactic neighbors, the graph contains edges that mirror the dependency graph for the sentence. Further knowledge sources that influence the hidden nodes are represented through additional edges that, for example, connect to document topic. The integration of adjacent knowledge sources is accomplished in a standard way by multiplying factors and marginalizing over variables. Evaluating on a paraphrasing task, we find that our model outperforms the current state-of-the-art usage vector model [Thater et al., 2010] on all parts of speech except verbs, where the previous model wins by a small margin. But our main focus is not on the numbers but on the fact that our model is flexible enough to encode different hypotheses about usage meaning computation. In particular, we concentrate on five questions (with minor variants): - Nonlocal syntactic context: Existing usage vector models only use a word's direct syntactic neighbors for disambiguation or inferring some other meaning representation. Would it help to have contextual information instead "flow" along the entire dependency graph, each word's inferred meaning relying on the paraphrase distribution of its neighbors? - Influence of collocational information: In some cases, it is intuitively plausible to use the selectional preference of a neighboring word towards the target to determine its meaning in context. How does modeling selectional preferences into the model affect performance? - Non-syntactic bag-of-words context: To what extent can non-syntactic information in the form of bag-of-words context help in inferring meaning? - Effects of parametrization: We experiment with two transformations of MLE. One interpolates various MLEs and another transforms it by exponentiating pointwise mutual information. Which performs better? - Type of hidden nodes: Our model posits a tier of hidden nodes immediately adjacent the surface tier of observed words to capture dynamic usage meaning. We examine the model based on by varying the hidden nodes such that in one the nodes have actual words as values and in the other the nodes have nameless indexes as values. The former has the benefit of interpretability while the latter allows more standard parameter estimation. Portions of this dissertation are derived from joint work between the author and Katrin Erk [submitted].