Predicting multibody assembly of proteins
This thesis addresses the multi-body assembly (MBA) problem in the context of protein assemblies. [...] In this thesis, we chose the protein assembly domain because accurate and reliable computational modeling, simulation and prediction of such assemblies would clearly accelerate discoveries in understanding of the complexities of metabolic pathways, identifying the molecular basis for normal health and diseases, and in the designing of new drugs and other therapeutics. [...] [We developed] F²Dock (Fast Fourier Docking) which includes a multi-term function which includes both a statistical thermodynamic approximation of molecular free energy as well as several of knowledge-based terms. Parameters of the scoring model were learned based on a large set of positive/negative examples, and when tested on 176 protein complexes of various types, showed excellent accuracy in ranking correct configurations higher (F² Dock ranks the correcti solution as the top ranked one in 22/176 cases, which is better than other unsupervised prediction software on the same benchmark). Most of the protein-protein interaction scoring terms can be expressed as integrals over the occupied volume, boundary, or a set of discrete points (atom locations), of distance dependent decaying kernels. We developed a dynamic adaptive grid (DAG) data structure which computes smooth surface and volumetric representations of a protein complex in O(m log m) time, where m is the number of atoms assuming that the smallest feature size h is [theta](r[subscript max]) where r[subscript max] is the radius of the largest atom; updates in O(log m) time; and uses O(m)memory. We also developed the dynamic packing grids (DPG) data structure which supports quasi-constant time updates (O(log w)) and spherical neighborhood queries (O(log log w)), where w is the word-size in the RAM. DPG and DAG together results in O(k) time approximation of scoring terms where k << m is the size of the contact region between proteins. [...] [W]e consider the symmetric spherical shell assembly case, where multiple copies of identical proteins tile the surface of a sphere. Though this is a restricted subclass of MBA, it is an important one since it would accelerate development of drugs and antibodies to prevent viruses from forming capsids, which have such spherical symmetry in nature. We proved that it is possible to characterize the space of possible symmetric spherical layouts using a small number of representative local arrangements (called tiles), and their global configurations (tiling). We further show that the tilings, and the mapping of proteins to tilings on arbitrary sized shells is parameterized by 3 discrete parameters and 6 continuous degrees of freedom; and the 3 discrete DOF can be restricted to a constant number of cases if the size of the shell is known (in terms of the number of protein n). We also consider the case where a coarse model of the whole complex of proteins are available. We show that even when such coarse models do not show atomic positions, they can be sufficient to identify a general location for each protein and its neighbors, and thereby restricts the configurational space. We developed an iterative refinement search protocol that leverages such multi-resolution structural data to predict accurate high resolution model of protein complexes, and successfully applied the protocol to model gp120, a protein on the spike of HIV and currently the most feasible target for anti-HIV drug design.