Applications of large, heterogeneous datasets in understanding and treating pathogenic microbes




DuPai, Cory David

Journal Title

Journal ISSN

Volume Title



Major advances in a myriad of technologies over the past two decades have led to a remarkable increase in the generation of biological data. In response to this increase, researchers have developed methods to pool and analyze large, heterogeneous datasets for novel insights. Here I do just that, leveraging existing data to expand our understanding of therapeutic proteins and pathogenic microbes. In Chapter 2 I outline major shortcomings in existing viral annotation standards using metadata from all influenza A sequences submitted to the GISAID database between 2005 and 2018. I further establish updated nomenclature standards to improve annotation accuracy moving forward. In Chapter 3 I use published Vibrio cholerae sequencing data to derive a comprehensive gene coexpression network. This network provides direct insights into genes influencing pathogenicity, metabolism, and transcriptional regulation, further clarifies results from previous sequencing experiments in V. cholerae, and expands upon micro-array based findings in related gram-negative bacteria. In Chapter 4 I systematically probe all 49,000 unique beta hairpin substructures contained within the Protein Data Bank to uncover key characteristics correlated with stable beta hairpin structure, including amino acid biases and enriched inter-strand contacts. I also establish a set of broad design principles that can be applied to the generation of libraries encoding bioactive proteins. These findings highlight the untapped potential, promise, and power of pooled analyses using large, heterogeneous datasets


LCSH Subject Headings