Algorithm robustness and its applications : Frank-Wolfe and factorized gradient
Access full-text files
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The statistical and computational demands of training modern machine learning algorithms, lead us to often run optimization algorithms far beyond the setting and assumptions for which they were designed. For example, scalability and the promise for parallelism often lead us to run algorithms designed to run on a single machine, on asynchronous hardware. For algorithms designed under the assumption of access to pristine gradient information, we often perform quantization for improved computational efficiency, or we infuse artificial noise or truncate the gradient information to preserve data privacy, build resilience to outliers, and so on. As a consequence, we can view the trajectories these iterative algorithms follow as having perturbations introduced at each step – perturbations that may be persistent (non zero-mean or neutral) and trajectory-dependent. Optimization algorithms that are versatile and can succeed in these settings, are so because they are resilient or robust to these errors or perturbations along the trajectory. In other words, successful algorithms do not accumulate or amplify these errors. We are interested in analyzing and designing optimization algorithms that display this versatility. Though the term is used in many contexts across different fields, here we say that an optimization algorithm enjoys robustness if it exhibits this resilience to perturbations along its trajectories. Quantifying robustness of certain important algorithms is a main topic of this work. This thesis has three main contributions. In the first and second part, we study the robustness of Frank-Wolfe type methods in two important settings: robustness against the perturbation created by asynchronous updates, and then robustness to the corruption produced by the injection of adversarially corrupted data in the training set. In the third part of the thesis, we consider the robustness of a gradient descent type method against rank-overspecification in matrix sensing. We now give a brief overview of these three contributions. For the first part of the thesis, we study how Frank-Wolfe (FW) type methods can deal with the perturbation introduced by asynchronous updates. We find that the Stochastic Frank Wolfe (SFW) method and its variants can run on asynchronous hardware, while maintaining the same convergence rate as if running on a single machine. Moreover, with a simple modification of the vanilla SFW algorithm, we can dramatically reduce the communication cost when the constraint set is a nuclear norm ball. When the variable matrix is of dimension D₁ × D₂, we are able to reduce the per-iteration communication cost from 𝒪(D₁D₂) as in most previous works to 𝒪(D₁ + D₂). We validate the effectiveness of our algorithm on Amazon EC2, and demonstrate that our proposed algorithm yields speed-ups almost linear in the number of machines compared to the single machine setting. For the second part of the thesis, we study how Frank-Wolfe (FW) type methods can deal with the perturbation introduced by adversarial data corruption and heavy tail data. We find that a few variants of the Frank-Wolfe type method are robust to adversarial data corruption or heavy tail data for structured statistical estimation problems if paired with the proper and standard de-noising techniques. First, we show that two particular variants of the Frank-Wolfe methods are stable, i.e., the error in each iteration does not accumulate. We describe these in detail below. Combined with the standard techniques that robustify the gradient estimation with multi-dimensional robust mean estimation, we can therefore guarantee the robustness to a wide class of problems, but now in a projectionfree algorithmic framework. Next, we consider high-dimensional problems. Algorithms that are based on multi-dimensional robust mean estimation may have an unacceptably high sample complexity in the high dimensional setting. When the constraint set is a ℓ₀ norm ball, Iterative-Hard-Thresholding-based methods have been developed recently to address the sample complexity challenge. Yet extension is non-trivial even for general constraint sets with 𝒪(d) extreme points. For settings where the feasible set has 𝒪(poly(d)) extreme points, we develop a novel robustness method, based on a new condition that we call the Robust Atom Selection Condition (RASC). Our method converges linearly up to a corresponding statistical error, with sample complexity that scales correctly in the sparsity of the problem, rather than the ambient dimension as would be required by the approaches based on multidimensional robust mean estimation. For the third part of the thesis, we study the Factorized Gradient Descent (FGD) method, and its performance in low-rank matrix sensing problems when the rank is not correctly specified. Our first observation is that this misspecification can be viewed as introducing a particular perturbation at each point of the algorithm’s trajectory, and hence this too falls directly in the scope of our work. Specifically, we consider solving the low rank matrix sensing problem with FGD when the specified rank is larger than the true rank. If the ground truth signal X* ∈ ℝᵈˣᵈ is of rank r, but we try to recover it using FFᵀ where F ∈ ℝᵈˣᵏ and k > r, the existing statistical analysis either no longer holds, or produces a trivial statistical error upper bound (infinity). A central challenge in the misspecified setting, and the reason that existing bounds become trivial, comes from the flat local curvature of the loss function around the global maxima. Addressing this is a main technical contribution of our work. By decomposing the factorized matrix F into separate column spaces to capture the perturbation introduced by the extra rank, we show that ||FₜFₜ − X*||[superscript 2 over subscript F] converges sublinearly up to a statistical error of Õ(kdσ²/n) after Õ([σ[subscript r] over σ]√[n over d]) iterations, where Fₜ is the output of FGD after t iterations, σ² is the variance of the observation noise, σᵣ is the r-th largest eigenvalue of X* , and n is the number of samples. With a precise characterization of the convergence rate and the statistical error, our results, therefore, offer a comprehensive picture of the statistical and computational complexity if we solve the rank-over-specified matrix sensing problem with FGD.