An Approach to Information Retrieval Based on Statistical Model Selection
Abstract
Building on previous work in the field of language modeling information retrieval (IR), this paper proposes a novel approach to document ranking based on statistical model selection. The proposed approach offers two main contributions. First, we posit the notion of a document's "null model," a language model that conditions our assessment of the document model's significance with respect to the query. Second, we introduce an information-theoretic model complexity penalty into document ranking. We rank documents on a penalized log-likelihood ratio comparing the probability that each document model generated the query versus the likelihood that a corresponding "null" model generated it. Each model is assessed by the Akaike information criterion (AIC), the expected Kullback-Leibler divergence between the observed model (null or non-null) and the underlying model that generated the data. We report experimental results where the model selection approach offers improvement over traditional LM retrieval.