Estimating recombination rates from genomic data using topological data analysis




Humphreys, Devon Paul

Journal Title

Journal ISSN

Volume Title



Accurate estimation of recombination rates from genomic data is critical in studying the origins of evolutionary diversity. Because the inference of recombination rates under a full evolutionary model is computationally expensive, an alternative approach using topological data analysis (TDA) has been proposed. Previous TDA methods used information contained solely in the topological feature known as the first Betti number (β1) of a sample of genome sequences, and this quantity is thought to relate to the number of loops that can be detected within a genealogy with recombination. These methods are considerably less computationally intensive than current model-based methods. However, the use of topological features has proven difficult to connect to the theory of the underlying biological process of recombination, and consequently, β1 has unpredictable behavior under various evolutionary scenarios involving recombination. We introduce a new topological feature which has a natural connection to coalescent models, which we call [symbol]. We show that [symbol] and β1 are differentially affected by different evolutionary and empirical scenarios in a given dataset, therefore we use them in conjunction to provide a more efficient, robust, and accurate estimator of recombination rates in a topological model implemented in new software we call TREE. Compared to previous TDA methods, TREE better approximates the results of model-based methods on an empirical dataset, and additionally outperforms previous TDA-based methods on simulated and empirical data. These characteristics make TREE well suited as a first-pass estimator of recombination rate heterogeneity and hotspots throughout the genome. Our work justifies the use of topological statistics as summaries of distributions of genome sequences and describes an unintuitive relationship between topological summaries of genetic distances and the impact of recombination on sequences


LCSH Subject Headings