Accelerating the biotechnology revolution with machine learning-guided protein engineering



Journal Title

Journal ISSN

Volume Title



An extremely important task in biotechnology is the ability to engineer proteins by introducing mutations into their sequences, which ultimately alters their folded structure and function. In nature, this process occurs via random mutation and selection, also known as evolution. Protein engineers have learned to limit the randomness and “direct” evolution, but this process is still too laborious and bottlenecks the application of biotechnology across all sectors of society. Machine learning (ML) guided protein engineering has the potential to revolutionize the development of protein-based biotechnology and enabling this future is the underlying theme of this thesis. To make meaningful advancements and enable ML-guided protein engineering both computational advancement and experimental validation are required. This dissertation presents studies that explore the capabilities of ML frameworks to protein data and experimental validation of structure-based ML frameworks. The first computational study examines the mutational landscape of proteins through the lens of 3D convolutional neural networks (3DCNNs) and evolution. The second study explores how to leverage recent advancements made in protein large language models (pLLMs) for supervised learning on protein stability. In this study, a supervised dataset that uses organism growth temperatures as coarse-grained label is curated and several machine learning techniques invented by the natural language and computer vision community are applied to fine-tune the pLLM, ESM-1b, to predict changes in thermal stability. On the experimental side, three studies on ML-guided protein engineering are presented. First, we used MutCompute, a 3D convolutional neural network (3DCNN), to identify stabilizing mutations on several PET hydrolase scaffolds and demonstrate the ML-engineered variants provide an avenue for the bioremediation of PET. Next, we demonstrate the utility of ML-guided protein engineering for the development of pandemic response biotechnology by stabilizing Bst DNA polymerase to enable low-resource COVID19 diagnostics. The third study is the capstone of this thesis. Here, a structure-based residual neural network (MutComputeX) is trained to generalize to protein-ligand interactions and a ML pipeline for the computational generation of protein-ligand complexes is developed and then combined to guide the active site engineering of norbelladine 4O-methyltransferase, a key enzyme for the biomanufacturing of the FDA-approved drug galantamine. This is the first demonstration of ML-guided active site engineering from a computational generated protein-ligand-cofactor ternary structure. Overall, these computational advancements and empirical validations of ML-guided protein engineering demonstrate that the future of industrial chemistry is a biological one.



LCSH Subject Headings