Matrix theory : optimization, concentration, and algorithms
The matrix plays an essential role in many theoretical computer science and machine learning problems. In this thesis, we develop a better understanding of matrices with a view towards these applications. Our insights yield improvements for a number of old, well-studied algorithmic problems. In this thesis, we study matrices from three perspectives. We first consider their role in optimization. We study a number of matrix optimization problems, and propose new solvers and results for linear programs, empirical risk minimization, ordinary differential equations, deep neural networks. We next consider how random matrices concentrate. Specifically, we generalize a number of scalar Chernoff-type concentration inequalities and the Spencer-type discrepancy theorems to matrices. Finally, we develop new algorithms for problems on matrices. These fall roughly into two sub-categories, namely matrix factorization problems and structured recovery problems. In the first category, we propose a number of new algorithms for a variety of low-rank matrix factorization problems. In the second category, we give new algorithms for some recovery tasks with structured matrices. We design matrices and corresponding algorithms for compressed sensing tasks, and we give fast algorithms for the sparse Fourier transform problem, which can be thought of as a sparse recovery problem where one cannot freely choose the matrix. We now describe our contributions in more detail. Linear programming is one of the fundamental problems in computer science. In both theory and practice, many problems can be solved via linear programming, and doing so often yields the best-known runtime. We present an algorithm that runs in the current matrix multiplication time, which breaks a thirty-year-old barrier. Furthermore, our technique can be generalized to speed up a large family of convex optimization problems, i.e., empirical risk minimization. Sampling logconcave functions which arise in statistics and machine learning has been the subject of intensive study. This problem can be thought of as a special case of solving multivariate ordinary differential equations (ODEs). Under sufficiently strong smoothness conditions, discrete algorithms of Hamiltonian Monte Carlo (HMC) have runtime and number of function evaluations growing with the dimension. We give new algorithms that obtain a nearly linear implementation of HMC for a broad class of smooth, strongly logconcave densities, with the number of iterations (parallel depth) and gradient evaluations being polylogarithmic in the dimension (rather than polynomial, as in previous work). Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. We prove why stochastic gradient descent (SGD) can find global minima on the training objective of DNNs in polynomial time. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. Furthermore, our results also hold for Recurrent Neural Networks (RNNs) which are multi-layer networks widely used in natural language processing. They are harder to analyze than feedforward neural networks, because the same recurrent unit is repeatedly applied across the entire time horizon. The Chernoff bound for the concentration of scalar random variables is a fundamental tool in the analysis of randomized algorithms. Over the past decade, a matrix generalization of the Chernoff bound has also found widespread application, but this generalization is fairly restrictive. For a number of decades, it has been open whether one can remove some of these restrictions. We answer this question in the affirmative by giving a number of new matrix Chernoff bounds under more relaxed independence assumptions than before. Spencer's theorem is a famous result in discrepancy theory, and it is an important open question how to generalize this result to the matrix setting. We make progress on this, and prove a matrix generalization of this result, in some restricted settings. Our result also generalizes the famous Kadison-Singer conjecture. The classical low rank approximation problem is : given a matrix A, find a rank-k matrix B such that the Frobenius norm of A - B is minimized. It can be solved in polynomial-time using, for instance, singular value decomposition. If one allows randomization and approximation, it can be solved in time proportional to the number of non-zero entries of A with high probability. We consider a number of natural generalizations of this important problem. We generalize the problem to consider a number of different norms, including weighted norms, entry-wise L1, and more general loss functions including Huber and Tukey loss. We also consider the natural tensor version of the problem. For all these settings, we give new state-of-the-art algorithms, including a number of new fixed parameter tractable algorithms. Compressed Sensing, or sparse recovery, is a powerful mathematical framework whose goal is to reconstruct an approximately sparse signal from linear measurements. We consider the extensively studied problem of L2/L2 compressed sensing. Our algorithm has faster decoding time and significantly smaller column sparsity, answering two open questions of prior work. Previous work on sublinear-time compressed sensing employed an iterative procedure, recovering the heavy coordinates in phases. We completely depart from that framework, and give the first sublinear-time algorithm which achieves the optimal number of measurements without iterating. The Fourier transform is ubiquitous in computer science, mathematics, and beyond. In recent years, a number of works have studied methods for computing the Fourier transform in sublinear time if the output is sparse. We consider two important settings in which this occurs: namely, the sparse high-dimensional setting, and the sparse continuous setting. In the high dimensional setting, we consider the extensively studied problem of computing a k-sparse approximation to the d-dimensional Fourier transform of a length n signal. Our algorithm achieves a nearly optimal sample complexity and runs in time comparable to the Fast Fourier Transform. All previous algorithms proceed either via the Restricted Isometry Property or via filter functions. Our approach totally departs from the aforementioned techniques, and we believe is a new approach to the sparse Fourier transform problem. While many of the works on sparse Fourier transforms have focused on the discrete setting, in many applications the input signal is continuous and naive discretization significantly worsens the sparsity level. Assuming the frequency gap is known, we present an algorithm for robustly computing sparse Fourier transforms in the continuous setting. Knowing the frequency gap is necessary for robustly identifying individual frequencies. However, was unknown whether interpolating the signal also requires a frequency gap. We resolve this problem by giving an algorithm which shows that such a gap not necessary for estimating the signal.