# Uncertainty quantification in computations with protein structures and complexes

## Access full-text files

## Date

## Authors

## Journal Title

## Journal ISSN

## Volume Title

## Publisher

## Abstract

In this thesis we develop an uncertainty quantification (UQ) framework applied to computationally-represented protein structures. We use probabilistic certificates in the form of Chernoff-like tail bounds to describe the uncertainty for a given quantity of interest (QOI). For an arbitrary functional f computing a quantity on uncertain data X, we construct a certificate that the value reported from a single observation differs from the expected value no more than some value t (w.h.p.): Pr[|f(X)−[doublestruck E][f(X)]|>t]≤ϵ. We also consider the case where f is an optimization, and thus provide a certificate on the deviation in terms of the value computed on the optimal input, x [superscript *]: Pr[|f(X)−f(x [superscript ∗)|>t]≤ϵ. An important aspect of the uncertainty quantification framework is the consideration of protein structures only in the local neighborhood surrounding a given input, or the uncertainty set for a given input. For universe [doublestruck X] and probability measure µ, we define the uncertainty set, U, as: U={X∈ [doublestruck X]:µ(X)>ϵ}. We observe that the probability measure, µ, is dependent on the specific protein of interest but also must be tailored to specific applications. Since the true measure for most proteins and their applications is unknown (and unknowable, as we discuss in this thesis), we provide a mathematical model for describing a surrogate distribution, ν, which is tailored to specific applications and amenable to computation. We call this representation the protein uncertainty representation, and define three different representations which are used in uncertainty quantification of QOIs for several different applications. After establishing our protein uncertainty representation, we present some initial certificates for a somewhat simplified protein system where analytical solutions to uncertainty of f under a simplified model can be derived. However, for generic (and more complex) QOI, we must use quasi-Monte Carlo sampling. The final ingredient of the uncertainty quantification framework is therefore the development of a sampling protocol which computes f at a set of low-discrepancy points. To construct these low-discrepancy points, we implement a pseudorandom number generator which is able to, both theoretically and practically (as we show in this thesis), generate points with lower discrepancy than other commonly-used generators. We use these low-discrepancy points to compute the remainder of the certificates for a much wider variety of QOIs. In the remaining chapters, we apply our uncertainty quantification framework to several applications showing the kinds and quantities of uncertainty that do exist, as well as showing how the use of these certificates enhances computational results and better correlates with biological findings. The first application we present provides certificates of uncertainty for simple QOI, including the surface area and enclosed volume of a protein. Since the inputs to typical applications computing surface area or volume are a single, high-resolution protein configuration, we only consider uncertainty arising from small atomic perturbations (given through temperature factors in the PDF file) and computationally represented as a product of multiple 3-dimensional (an)isotropic Gaussian distributions. We show that QOI computed on samples from this product space can vary substantially from the QOI reported on the initial input structure (especially as the complexity of the QOI increases), but also show that the resulting certificates consistently converge to the same value. Next, we apply the same uncertainty quantification techniques to a more difficult problem: viral capsid assembly from single protein subunits. We represent the conformational space for the assembly problem using both inter- (arising from hinge flexibility along the backbone) and intra-protein uncertainty (represented by 6 degrees of freedom between single subunits). We use a Bayesian factor graph to model the chemical equilibrium of assembly transitions, where both associations and dissociations are modeled based on concentrations of subunits and the binding free energies. We report certificates on the resulting QOI when applied to the uncertain inputs, and highlight areas in which our model predicts different assembly pathways compared to those predicted using traditional methods. The third application in this thesis expands the uncertainty set to also consider large-scale protein conformational changes around a given input configuration. We develop a model which uses von-Mises weighted context-aware Ramachandran distributions to describe the range of possible internal torsion angle assignments. We show the improvement of using this full flexibility model when applied to the unbound protein-protein docking problem for several different protein input structures and using several different commonly-used programs. Additionally, we compare the probabilistic certificates computed on the unbound protein-protein docking case with the actual value (reported on the bound protein), providing a mechanism for identifying proteins with multiple low-energy wells. Finally, we apply the same uncertainty quantification techniques in a process known as uncertainty propagation, where uncertainty in one stage of a pipeline propagates to further uncertainty at later stages. Our specific application is toxin interactions with the human voltage-gated sodium (Na [subscript v]) channel, where we incorporate the uncertainty of both the predicted channel and potentially-bound toxins. In this final application, we present some preliminary results on the correlations between the certificate-bounded QOI and biologically known values, and show how these certificates provide a better view of the relationship between the computationally-computed and biological values.