Algorithms and heuristics for combinatorial optimization in phylogeny
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The goal of phylogeny is to infer the evolutionary history of the numerous and diverse species on earth. The evolutionary history is usually represented in the form of a phylogenetic tree. Reconstruction of the phylogenetic history of all life on earth is one of the central goals of biology and is also a problem that has many practical applications, ranging from drug design to conservation of biodiversity. My dissertation addresses some of the combinatorial optimization problems that arise in phylogenetic reconstruction and the related field of historical biogeography. Two of the most important methods for reconstructing phylogenies are the optimization problems Maximum parsimony (MP) and Maximum likelihood (ML). Maximum parsimony is a purely combinatorial optimization problem, whereas ML requires an explicit probabilistic model of evolution. Both MP and ML are NP-hard problems, and the best current methods for obtaining good MP and ML solutions are based on some form of local search. Hence, developing better local-search heuristics for ML and MP is a problem of primary importance in phylogeny. Most current local searches are based on the Tree Bisection and Reconnection (TBR) operation for moving from tree to tree. In my dissertation, I consider an alternative local-search move called Edge-Contract-and-Refine (ECR) and explore its theoretical properties in detail. We prove that the neighborhood structure induced by the ECR move on the search space has many attractive properties. These properties suggest it can complement the TBR move in local-searches. We provide fast algorithms for computing the 2-ECR neighbor of a tree with the best MP score during a local search, and for computing a random ECR neighbor. Further, we conduct an extensive comparative study of popular ML heuristics involving both large biological and simulated datasets. We also investigate the impact of the tree used to start the search on the quality of the results obtained. Our results suggest that of the currently available ML software, PHYML, RAxML and GARLI are the three best ML methods. Our results also suggest that the starting tree affects the performance of the search. However, the correlation between the quality of the starting tree and the quality of the final solution is not straightforward. Phylogenetic reconstruction methods like MP and ML often return not a single tree but hundreds, even thousands, of trees with the same optimality score. Consensus phylogenetic trees are often used to summarize the common features observed in all or most of the resultant trees. Maximum Compatible Subtree (MCT) is one such consensus tree. In my dissertation, I provide a fixed-parameter tractable algorithm for the MCT problem when the input trees have a bounded degree. Further, I provide fast polynomial time approximation algorithms for the complement of the MCT problem. The goal of historical biogeography is to infer the history of the geographic distribution of organisms in the light of their evolutionary history. One of the most important tools in historical biogeographical inference is the comparison of phylogenies of different groups of species that share their geographic distribution. Common patterns observed among these different phylogenies suggest a shared geographic history. So that they can be compared, the phylogenies are converted to area cladograms, where the taxa are the geographic locations of organisms. Until very recently, area cladograms have been compared only visually. In my dissertation, we formalize the problem of comparing area cladograms in the form of the Maximum Agreement Area Cladogram (MAAC) problem, and provide a fast polynomial-time algorithm for computing the MAAC of two area cladograms. Further, we develop distance metrics for the comparison of area cladograms.