## Statistical consistency of maximum parsimony: A 3-state, 3-taxa model

##### Abstract

Phylogenetics, the study of evolutionary relationships among species, bridges numerous
disciplines, notably mathematics and biology. While biologists and computer scientists might be
more concerned with the net result of phylogenetic methods, i.e. the evolutionary tree depicting
the evolution of species, mathematicians tend to focus on the theory that forms the basis of these
methods. Accordingly, techniques have been developed that make varying assumptions about the
process of evolution. The maximum parsimony method assumes that the correct phylogenetic
tree is the one that predicts the fewest number of changes in genetic sequences as species evolve
over time. This assumption resembles the concept of Ockham’s Razor, that the simplest
explanation is usually the correct one (Semple, 84). In this study, we will examine maximum
parsimony and analyze a particular model to display some properties of the method. Different phylogenetic methods possess differing statistical properties, often because they
make different assumptions about the way evolution occurs. Most notably, the methods can vary
with respect to statistical consistency, the property that as the size of the sample used to produce
an estimate increases, the estimate approaches the true value. For phylogenetic methods,
consistency refers to the length of the gene sequences that are sampled. So for a phylogenetic
method to be consistent, it must be that as the length of the compared DNA sequences grows, the
method more accurately predicts the actual tree (i.e. tells us how the evolution actually
occurred). Thus statistical consistency can distinguish between methods to help determine which
might be the most accurate to use in predicting a tree of life. In this study we will analyze a 3-DNA base pair, 3-species (3 states, 3 taxa) model using
the maximum parsimony method to determine if maximum parsimony is a consistent
phylogenetic method. The model considers the following evolutionary tree: (Felsenstein 403). Here evolution occurs along edges I-V resulting in species A, B, and C. The values P, Q, and R
indicate the probability of changing from one base pair to another along the corresponding edge.
Intuitively, this change represents a mutation in DNA sequence that leads to creation of a new
species. By analyzing maximum parsimony under this model, we find that by varying the
probabilities of changing along an edge, the maximum parsimony method can become
inconsistent and predict the incorrect tree.