Adaptive and weighted optimization for efficient and robust learning

Xie, Yuege
Journal Title
Journal ISSN
Volume Title

Modern machine learning has made significant breakthroughs for scientific and technological applications and led to paradigm shifts in optimization and generalization theories. Adaptive and weighted optimization have become the workhorses behind today's machine learning applications, but there is still much to learn about why they work in practice and how we can further improve their efficiency and robustness. In this thesis, we first establish the linear convergence of adaptive optimization and then analyze the generalization error of weighted optimization. With these theoretical results, we develop efficient and robust learning algorithms to tackle real-world problems such as model sparsification, image classification, and medical image segmentation.

To establish linear convergence guarantees for AdaGrad-Norm, an adaptive gradient descent algorithm, we develop a two-stage analysis framework and show that the convergence is robust to the initial learning rate. Unlike prior work, our analysis does not require knowledge of smoothness parameters or strong convexity parameters. To understand the generalization of weighted trigonometric interpolation, we derive exact expressions of the generalization error of both plain and weighted least squares estimators. Then we show how a bias towards smooth interpolants can lead to smaller generalization errors in the overparameterized regime.

For efficient sparse model learning, we propose SHRIMP (Sparser Random feature model via Iterative Magnitude Pruning) to adaptively fit high-dimensional data with inherent low-dimensional structure. SHRIMP performs better than other sparse feature models under lower computational complexity while enabling feature selection and being robust to pruning rates. To further improve the computational efficiency and robustness of AdaGrad-Norm, we propose AdaLoss, an adaptive learning rate schedule that uses only the loss function instead of computing gradient norms. On top of AdaLoss, we enhance data augmentation consistency regularization with an adaptively weighted schedule (\ours) using loss information to handle volumetric medical image segmentation with both sparsely labeled and densely labeled slices. We evaluate our method on CT and MRI scans and demonstrate superior performance over several baselines.