Large-scale clustering: algorithms and applications
Abstract
Clustering is a central problem in unsupervised learning for discovering interesting patterns in the underlying data. Though there have been numerous studies on clustering methods, the focus of this dissertation is on developing efficient clustering algorithms for large-scale applications such as text mining, network analysis, image
segmentation and bioinformatics.
We first present a time and memory efficient technique for the entire process
of text clustering, including the
creation of the vector space model for documents. This efficiency is obtained by (i) a memory-efficient multi-threaded preprocessing scheme, and (ii) a fast
clustering algorithm that fully exploits the sparsity of the data set. We show that this entire process takes time that is linear in the size of the document collection.
Clustering algorithms which are based on heuristics can get trapped in inferior
local optima therefore yielding qualitatively poor results. As the second part of
our work, we propose the use of local search and annealing to improve the quality of the clustering results. In local search, we create a chain of incremental point moves that leads the objective function out of local optima; while the idea of annealing is that we enforce the perturbation of cluster enters after
clusters become stabilized. The effectiveness of these techniques is illustrated in text clustering and
gene expression analysis.
Data in many domains, such as
cluster analysis of the world wide web or circuit partitioning, is represented as graphs. Clustering is often used to find and analyze structural and functional properties of these graphs. In the last part of the dissertation, we present an efficient, high-quality multilevel kernel-based graph
clustering algorithm, which outperforms previous state-of-the-art spectral methods
in quality and runs hundreds or even thousands of times faster. Our multilevel graph clustering algorithm is based on a theoretical connection with the weighted kernel k-means clustering algorithm. We empirically demonstrate that our algorithm is efficient and effective on large social networks, protein interaction networks and
image segmentation.
Department
Description
text