Algorithms for sparse and structurally constrained discrete optimization problems in bioinformatics and communications
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Inference from high-dimensional noisy data, the task encountered in a wide range of applications including those in wireless communications and bioinformatics, is often computationally challenging. Structural constraints that sometimes restrict the space of possible solutions may adversely affect performance of the existing algorithms. Furthermore, when problem variables come from a finite alphabet, such inference problems are often NP-hard. However, one can exploit auxiliary information about the problem's structure to improve accuracy and reduce computational cost of finding a solution to such inference problems. In the first part of this dissertation, we study the setting where an ℓ₀-norm constraint is imposed on the solution to the integer least-square problem. We propose the sparsity-aware sphere decoding algorithm for finding a near-optimal solution to this problem, analyze its computational complexity for commonly used alphabets, and propose its fast variant. These algorithms can be successfully applied to a variety of settings, as demonstrated in the application to sparse channel estimation. In the second part, we focus on discrete valued data clustering problems, with observations that have non-zero representation on subsets of features, leading to sparsity in the object-to-object similarity space. Clustering objects with such structural constraints is computationally hard; however, one can leverage inherent sparsity to reduce the computational burden of sub-optimal heuristics. We study the problems of Single Individual Haplotyping (SIH) and Quasispecies Reconstruction (QSR), which involve determination of genetic variations causing disease susceptibility in humans and evolutionary fitness in RNA viruses, respectively. These problems require reconstruction of finite alphabet parental sequences using a library of sparse random samples of their mixtures. Our contribution here is two-fold: (i) development of a binary-constrained alternating minimization approach for SIH and providing first theoretical guarantee on the error performance, and (iii) design of an end-to-end clustering framework for QSR which estimates the number, composition and proportions of the distinct viral sequences. Experimental results with synthetic and biological datasets from state-of-the-art libraries demonstrate efficacy of the proposed algorithms. Respective software implementations, called HapAltMin and QSdpR, have been made available for use to bioinformatics community.