Expanding the applications of high-throughput DNA sequencing
DNA sequencing is the process of determining the identities of the nucleotides that make up a molecule of DNA. The rapid pace of advancements in sequencing technologies in recent years have made it possible to simultaneously determine the sequences of hundreds of millions of short DNA fragments. The ability to perform sequencing with such high throughput has revolutionized the study of biological systems, but the types of questions that can be answered through sequencing-based experiments can be limited by the presence of different kinds of noise and biases in these experiments. One class of applications of high-throughput sequencing involves identifying genetic variation, such as finding rare mutations in the genomes of cancerous cells. In these applications, the sensitivity with which rare genetic variants can be detected is limited by the relatively high rate with which current DNA sequencing technologies incorrectly identify nucleotides. In the first half of this thesis, we present a method for dramatically reducing the rate at which these incorrect identifications occur. Our method, called circle sequencing, creates redundant copies of the sequence of each input molecule of DNA. This is accomplished by circularizing each DNA fragment and performing rolling circle amplification on these circles with a strand-displacing polymerase. The resulting products consist of several physically linked copies of the original sequence in each fragment. When these products are sequenced, this informational redundancy protects against random errors introduced during sequencing, allowing for highly accurate recovery of the original sequence of each input molecule. By eliminating the vast majority of incorrectly identified nucleotides from the resulting data, our method enables the sensitive detection of rare variants and opens up exciting new questions involving such variants to direct measurement by sequencing. An entirely different application of high-throughput sequencing is to selectively capture and sequence stretches of DNA or RNA that are participating in a process of interest within a cell. The accuracy of quantitative inferences made by this type of experiment can be severely impacted, however, by biases introduced during the experimental manipulations used to isolate biologically relevant fragments of DNA from cells. Ribosome profiling is an experimental technique that consists of sequencing short stretches of messenger RNAs that are protected from nuclease digestion by the presence of a bound ribosome. The resulting data represents millions of snapshots of the locations of actively translating ribosomes. In theory, these snapshots can be used to determine how long ribosomes take to translate each type of codon by quantifying how often ribosomes are observed positioned over that codon. In practice, different studies in yeast attempting to do this have reached contradictory and counterintuitive conclusions. In the second half of this thesis, we perform a large-scale comparative analysis of data from many different ribosome profiling experiments in order to resolve these contradictions. We identify a previously unappreciated source of systematic bias in a subset of these experiments. This bias prevents these experiments from accurately measuring ribosomes in proportion to how long they spend at each position in vivo. Understanding this bias provides insight into the true signatures of translation dynamics in yeast and offers important guidance for the future design and interpretation of sequencing-based approaches to measuring these dynamics.