Benchmarking of single nucleotide somatic variant calling

Derryberry, Dakota Zipporah

Benchmarking of single nucleotide somatic variant calling

Access full-text files

DERRYBERRY-THESIS-2017.pdf (2.57 MB)

Date

2017-08

Authors

Derryberry, Dakota Zipporah

Abstract

Cancer, which affects hundreds of thousands of people worldwide every year and costs billions in treatment, is a disease caused by mutations that arise in somatic cell lines and contribute to abnormal and pathological behaviors and growth in cells. These mutations are called somatic variants and there are several types. The simplest somatic variants are single-nucleotide somatic variants, which differ between a patient’s tumor and normal DNA by only a single base pair.

To better treat and understand cancer, clinicians and researchers respectively seek to identify and locate cancer-relevant mutations. The low cost and high throughput of next-generation sequencing methods has made this the preferred platform for somatic variant discovery and identification over the last five years. Despite its widespread adoption, much remains unknown about the reliability of this method. Benchmarking somatic variant calling pipelines, the topic of this thesis, is the process of attempting to fill this gap by quantifying the quality of the variant calling process in terms of the accuracy, precision, and reproducibility of results.

In chapter one of this thesis, I present a review of current methods and benchmarking of single-nucleotide somatic variant calling. I begin with an overview of the variant calling process, from raw reads to high quality variant calls. Next, I discuss what is known about the quality of results produced by the computational variant discovery pipeline. Finally, I present open questions and possible areas of future research.

In chapter two, I present original research concerning the filtering process at the end of the single-nucleotide somatic variant calling pipeline that attempts to distinguish between real somatic variant calls and errors. Using multiple sequencing runs from the same tumors and using concordance between runs as a measure of accuracy, I show that filters based on alignment features are the most effective at removing errors while keeping true variants.