On integrating biological sequence analysis with metric distance based database management systems
Biological data analysis can uncover hidden properties of existing experimental results, and guide laboratory experiments to improved efficiency. However, the growing volume of biological data poses a formidable computational challenge for data analysis. Continuous programming efforts must be made to provide efficient solutions for new sequence analysis problems that are constantly being raised. Using metric distance-based indexing methods to manage biological data has the potential to not only provide scalability for solving large-scale data analysis problems, but also simplify the biological discovery process by integrating relational database management systems with the SQL programming model. This dissertation focuses on algorithms and applications of integrating biological sequence analysis with database management systems that store biological data in a distance-based index structure. Three major issues of this integration are addressed in this dissertation: semantic extensions for representing biological sequences in a database management system, metric distance functions between sequence objects, and proximity search algorithms in metric space. Specifically, the sequences are organized as overlapping q-grams so that similarity can be evaluated by weighted Hamming distance parameterized by a metricdistance substitution matrix. Two methods of deriving the metric-distance substitution matrix are developed. And finally an anytime k-nearest search algorithm is developed to provide cost-effective proximity search to overcome the “curse of dimensionality” for searching large datasets. The trade-off between accuracy, speed and scalability of this approach is evaluated by homologous sequence retrieval. The evaluation using a proteomics benchmark shows that the distance-based q-gram retrieval can still maintain accuracy when q is at least as large as 6, which creates a domain of over 60 million key values and enables sufficient scalability to provide effective performance for large disk-resident sequence databases. The broad applicability of integrating the sequence analysis with relational database management systems is illustrated by proposing SQL solutions for several common analysis problems in bioinformatics. The effectiveness of our approach is demonstrated by solving a genome comparison problem: identifying conserved primer pair candidate sites between two plant genomes. The new solution is estimated to be an order of magnitude faster than alternative methods that use a combination of BLAST and script programming.