Large-scale clustering: algorithms and applications
Clustering is a central problem in unsupervised learning for discovering interesting patterns in the underlying data. Though there have been numerous studies on clustering methods, the focus of this dissertation is on developing efficient clustering algorithms for large-scale applications such as text mining, network analysis, image segmentation and bioinformatics. We first present a time and memory efficient technique for the entire process of text clustering, including the creation of the vector space model for documents. This efficiency is obtained by (i) a memory-efficient multi-threaded preprocessing scheme, and (ii) a fast clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection. Clustering algorithms which are based on heuristics can get trapped in inferior local optima therefore yielding qualitatively poor results. As the second part of our work, we propose the use of local search and annealing to improve the quality of the clustering results. In local search, we create a chain of incremental point moves that leads the objective function out of local optima; while the idea of annealing is that we enforce the perturbation of cluster enters after clusters become stabilized. The effectiveness of these techniques is illustrated in text clustering and gene expression analysis. Data in many domains, such as cluster analysis of the world wide web or circuit partitioning, is represented as graphs. Clustering is often used to find and analyze structural and functional properties of these graphs. In the last part of the dissertation, we present an efficient, high-quality multilevel kernel-based graph clustering algorithm, which outperforms previous state-of-the-art spectral methods in quality and runs hundreds or even thousands of times faster. Our multilevel graph clustering algorithm is based on a theoretical connection with the weighted kernel k-means clustering algorithm. We empirically demonstrate that our algorithm is efficient and effective on large social networks, protein interaction networks and image segmentation.