Estimating species trees from gene trees despite gene tree incongruence under realistic model conditions
MetadataShow full item record
Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. With the rapid growth rate of newly sequenced genomes, species tree inference from multiple genes has become one of the basic and popular tasks in comparative and evolutionary biology. However, combining data on multiple genes is not a trivial task since genes evolve through biological processes that include deep coalescence (also known as incomplete lineage sorting (ILS)), duplication and loss, horizontal gene transfer etc., so that the individual gene histories can differ from each other. In this dissertation, we focus on making advances on phylogenomic analyses with particular attention to the gene tree discordance. In addition to gene tree discordance, we consider other challenging conditions that frequently arise in genome scale data. One of these major challenges is incomplete gene trees, meaning that not all gene trees have individuals from all the species. We performed an extensive simulation study under the multi-species coalescent (MSC) model that shows that existing methods have poor accuracy when gene trees are incomplete. We formalized the optimal completion problem, which seeks to add the missing taxa (species) into the gene trees with respect to a species tree such that the distance (in terms of ILS) between the gene tree and the species tree is minimized. We developed an algorithm for solving this problem. We formalized optimization problems in the context of species tree estimation from a set of incomplete gene trees under the multi-species coalescent model, and proposed algorithms for solving these problems. We formulated different mathematical models for “gene loss” based on different reasons for incompleteness. Next, we addressed the Minimize Gene Duplication (MGD) problem, that seeks to find a species tree from a set of gene trees so as to minimize the total number of duplications needed to explain the evolutionary history. We proposed exact and heuristic algorithms to solve this NP-hard problem. Next, we showed in a comprehensive experimental study that existing methods are susceptible to poorly estimated gene trees in the presence of ILS. We proposed a new technique called “binning” that dramatically improves the performance of species tree estimation methods when gene trees are poorly estimated. We developed a novel technique called “naive binning” and subsequently proposed an improved version called “weighted statistical binning” to address the problem of gene tree estimation error. Finally, we addressed the computational challenges to reconstruct highly accurate species tree from large scale genomic data. We developed divide-and-conquer based meta-methods that can make existing methods scalable to very large datasets (in terms of the number of species). Overall, this dissertation contributes to understanding the limitations of the existing methods under realistic model conditions, developing new approaches to handle the challenging issues that frequently arise in phylogenomics, and improving and scaling the existing methods to larger datasets.