A systems approach to computational protein identification
Proteomics is the science of understanding the dynamic protein content of an organism's cells (its proteome), which is one of the largest current challenges in biology. Computational proteomics is an active research area that involves in-silico methods for the analysis of high-throughput protein identification data. Current methods are based on a technology called tandem mass spectrometry (MS/MS) and suffer from low coverage and accuracy, reliably identifying only 20-40% of the proteome. This dissertation addresses recall, precision, speed and scalability of computational proteomics experiments.
This research goes beyond the traditional paradigm of analyzing MS/MS experiments in isolation, instead learning priors of protein presence from the joint analysis of various systems biology data sources. This integrative `systems' approach to protein identification is very effective, as demonstrated by two new methods. The first, MSNet, introduces a social model for protein identification and leverages functional dependencies from genome-scale, probabilistic, gene functional networks. The second, MSPresso, learns a gene expression prior from a joint analysis of mRNA and proteomics experiments on similar samples.
These two sources of prior information result in more accurate estimates of protein presence, and increase protein recall by as much as 30% in complex samples, while also increasing precision. A comprehensive suite of benchmarking datasets is introduced for evaluation in yeast. Methods to assess statistical significance in the absence of ground truth are also introduced and employed whenever applicable.
This dissertation also describes a database indexing solution to improve speed and scalability of protein identification experiments. The method, MSFound, customizes a metric-space database index and its associated approximate k-nearest-neighbor search algorithm with a semi-metric distance designed to match noisy spectra. MSFound achieves an order of magnitude speedup over traditional spectra database searches while maintaining scalability.