Scalable clustering algorithms

dc.contributor.advisorGhosh, Joydeepen
dc.creatorBanerjee, Arindamen
dc.date.accessioned2008-08-28T22:18:42Zen
dc.date.available2008-08-28T22:18:42Zen
dc.date.issued2005en
dc.description.abstractScalable clustering algorithms that can work with a wide variety of distance measures and also incorporate application specific requirements are critically important for modern day data analysis and predictive modeling. In this thesis, we propose and analyze a large class of such algorithms, evaluate their performance on benchmark datasets and investigate theoretical connections of the proposed algorithms to lossy compression and stochastic prediction. First, a wide variety of popular centroid based clustering algorithms are unified using a large class of distance measures known as Bregman divergences. We present both hard and soft-clustering algorithms using Bregman divergences. By establishing a bijection between regular exponential family distributions and regular Bregman divergences, we note that Bregman soft clustering algorithms are equivalent to learning mixtures of exponential family distributions, but can be computationally more efficient in practice. We also design algorithms for clustering directional data that generate balanced clusters, i.e., clusters of comparable sizes, a desirable property in certain practical applications. Experimental results show that such algorithms perform well for high-dimensional problems such as text clustering. A general framework for scaling up balanced clustering algorithms is then proposed. The framework is applicable to all the algorithms presented in this thesis as well as a wide variety of other algorithms. Extensive experimental results on benchmark datasets are provided to establish the efficacy of the proposed framework. Further, we propose a new method for evaluation and model selection for clustering that can be applied to practically any clustering algorithm. The method is applicable in a transductive setting and measures the predictive accuracy of a clustering algorithm. A detailed analysis of the connections of rate distortion theory to the proposed clustering algorithms, in particular the Bregman clustering algorithms, is also presented. In the process, we establish some key theoretical results in rate distortion theory for Bregman divergences, special cases of which has been studied in the literature using squared Euclidean distance. Also, we generalize a widely known result in stochastic prediction by establishing that the conditional expectation is the optimal predictor of a random variable if and only if the prediction error is measured by a Bregman divergence. This results explains the fundamental reason behind the efficiency of the Bregman clustering algorithms.
dc.description.departmentElectrical and Computer Engineeringen
dc.format.mediumelectronicen
dc.identifierb60167610en
dc.identifier.oclc62260116en
dc.identifier.proqst3187659en
dc.identifier.urihttp://hdl.handle.net/2152/1818en
dc.language.isoengen
dc.rightsCopyright is held by the author. Presentation of this material on the Libraries' web site by University Libraries, The University of Texas at Austin was made possible under a limited license grant from the author who has retained all copyrights in the works.en
dc.subject.lcshCluster analysisen
dc.subject.lcshComputer algorithmsen
dc.titleScalable clustering algorithmsen
dc.type.genreThesisen
thesis.degree.departmentElectrical and Computer Engineeringen
thesis.degree.disciplineElectrical and Computer Engineeringen
thesis.degree.grantorThe University of Texas at Austinen
thesis.degree.levelDoctoralen
thesis.degree.nameDoctor of Philosophyen

Access full-text files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
banerjeea16458.pdf
Size:
1.46 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.65 KB
Format:
Plain Text
Description: