Unsupervised learning for large-scale data




Wu, Shanshan, Ph. D.

Journal Title

Journal ISSN

Volume Title



Unsupervised learning involves inferring the inherent structures or patterns from unlabeled data. Since there is no label information, the fundamental challenge of unsupervised learning is that the objective function is not explicitly defined. The ubiquity of large-scale datasets adds another layer of complexity to the overall learning problem. When the data size or dimension is large, even algorithms with quadratic runtime may be prohibitive. This thesis presents four large-scale unsupervised learning problems. We start with two density estimation problems: given samples from a one-layer ReLU generative model or a discrete pairwise graphical model, the goal is to recover the parameters of the generative model. We then move to representation learning of high-dimensional sparse data coming from one-hot encoded categorical features. We assume that there are additional but a-priori unknown structures in their support. The goal is to learn a lossless low-dimensional embedding for the given data. Our last problem is to compute low-rank approximations of a matrix product given the individual matrices. We are interested in the setting where the matrices are too large and can only be stored in the disk. For every problem presented in this thesis, we (i) design novel and efficient algorithms to capture the inherent structure from data in an unsupervised manner; (ii) establish theoretical guarantees and compare the empirical performance with the state-of-the-art methods; and (iii) provide source code to support our experimental findings


LCSH Subject Headings