Developing scalable quartet tree encodings
Reconstructing the Tree of Life, the evolutionary history of all species, stands as one of the most significant and intensive problems in computational biology. One approach to this grand project is to use supertree methods that merge a set of smaller trees (or source trees) into one single tree. In practice, most biologists use a particular supertree method called Matrix Representation with Parsimony (MRP) due to its topological accuracy as compared to most other methods. Recently, Snir and Rao presented a new supertree method that first encodes the source trees as a set of four-leaf trees and then uses Quartet Maxcut (QMC) on these quartet trees to compute a single overall tree. On certain realistic model conditions, this supertree method using a particular quartet encoding, Exp + TSQ, was shown to outperform MRP in terms of topological accuracy. However, this supertree method have many limitations. First, it fails to complete on many cases. Second, its subroutine Exp+TSQ is computationally intensive because it examines all possible quartets. These limitations discourage the use of QMC on Exp+TSQ. Thus, we extend the QMC study in the hope of designing a new scalable quartet encoding that would further improve this supertree estimation. Our quartet encodings are based on two ideas: the examination of all possible quartets on large trees is unnecessary, and the taxon sampling density of the source tree should be taken into account in the encoding. We propose an alternative time-efficient and robust encoding UniformK +TSQ* that may be used to substitute for Exp+TSQ without compromising the accuracy of the supertree method.