Learning frameworks utilizing domain knowledge for reconstruction and analysis of biological and communication systems

Ke, Ziqi

Learning frameworks utilizing domain knowledge for reconstruction and analysis of biological and communication systems

Access full-text files

KE-DISSERTATION-2021.pdf (10.76 MB)

Date

2021-08

Authors

Ke, Ziqi

Abstract

In this thesis, we investigate learning frameworks for several problems in bioinformatics and communications. In particular, we present and study auto-encoder architectures for the challenging problems of haplotype and viral quasispecies reconstruction in bioinformatics, modulation/technology classification in communication systems, and reconstruction of biological as well as communication networks. A common thread that connects these subjects is exploitation and incorporation of domain specific knowledge in the design of developed learning frameworks.

We begin by presenting the first ever neural network-based learning framework, which we refer to as GAEseq, for haplotype assembly and viral quasispecies reconstruction problems. Reconstructing components of a genomic mixture from data obtained by means of DNA sequencing is a challenging problem encountered in a variety of applications including single individual haplotyping and studies of viral communities. High-throughput DNA sequencing platforms oversample mixture components to provide massive amounts of reads whose relative positions can be determined by mapping the reads to a known reference genome; assembly of the components, however, requires discovery of the reads' origin -- an NP-hard problem that the existing methods struggle to solve with the required level of accuracy. The proposed algorithm is a neural network which essentially trains to ignore sequencing errors and infers the posterior probabilities of the origin of sequencing reads. Mixture components are then reconstructed by finding consensus of the reads determined to originate from the same genomic component. Results on realistic synthetic as well as experimental data demonstrate that the proposed framework reliably assembles haplotypes and reconstructs viral communities, often significantly outperforming state-of-the-art techniques.

While capable of providing orders of magnitude higher accuracy than existing schemes, GAEseq is at disadvantage compared to competing methods when it comes to computational complexity. To this end, we developed an alternative learning framework for read clustering that is based on a convolutional auto-encoder. The proposed framework is designed to first project sequenced fragments to a low-dimensional space and then estimate the probability of the read origin using learned embedded features. The components are reconstructed by finding consensus sequences that agglomerate reads from the same origin. Mini-batch stochastic gradient descent and dimension reduction of reads allow the proposed method to efficiently deal with massive numbers of long reads. Experiments on simulated, semi-experimental and experimental data demonstrate the ability of the proposed method to reconstruct haplotypes and viral quasispecies with accuracy that parallels that of GAEseq while being significantly faster.

We then turn our attention to problems in communications and propose a learning framework for technology/modulation classification. The proposed framework is based on an LSTM denoising auto-encoder designed to automatically extract stable and robust features from noisy radio signals, and infer modulation or technology type using the learned features. Identification of the type of communication technology and/or modulation scheme based on detected radio signal are challenging problems encountered in a variety of applications including spectrum allocation and radio interference mitigation. They are rendered difficult due to a growing number of emitter types and varied effects of real-world channels upon the radio signal. Existing spectrum monitoring techniques are capable of acquiring massive amounts of radio and real-time spectrum data using compact sensors deployed in a variety of settings. However, state-of-the-art methods that use such data to classify emitter types and detect communication schemes struggle to achieve required levels of accuracy at a computational efficiency that would allow their implementation on low-cost computational platforms. The proposed framework utilizes a compact neural network architecture readily implemented on a low-cost computational platform while exceeding state-of-the-art accuracy. Results on realistic synthetic as well as over-the-air radio data demonstrate that the proposed framework reliably and efficiently classifies received radio signals, often significantly outperforming state-of-the-art techniques.

Finally, we propose to investigate the problem of reconstructing and analyzing networks based on the signals/information being "exchanged" between its nodes. Such tasks are encountered in both communication and biological networks; our focus will primarily be on the latter, where we are motivated by the problem of disease transmission. Understanding the transmission dynamics of a virus is of fundamental importance for establishing public health policies and putting an end to a disease outbreak. However, classical methods that rely on epidemiological data such as times of sample collection and exposure intervals struggle to provide desired insight due to limited informativeness of such data. In particular, the time of sample collection is an unreliable indicator of the time of infection, especially for a disease that may be asymptomatic long after the infection. Next-generation sequencing technologies enable real-time and accurate reconstruction of viral populations and thus allow the measurement of viral genetic distance between samples. Because viral genetic distance between viral strains present in different hosts contains valuable information about transmission history and due to the limitation of epidemiological data, it motivates the design of a method capable of detecting disease transmission clusters, reconstructing a directed disease transmission network and identifying super-spreaders in the network from viral genomic data. To this end, we proposed a novel end-to-end framework for the problem of understanding the transmission dynamics of a virus utilizing viral genomic data. Results on realistic synthetic as well as experimental data demonstrate that the proposed framework outperforms state-of-the-art techniques for understanding the transmission dynamics of a virus.