Robust methods for locating multiple dense regions in complex datasets
MetadataShow full item record
In classical clustering, each data point is assigned to at least one cluster. However, in many real-world problems, only a small subset of the data clusters well, while the rest shows little or no clustering tendencies. For such situations, this thesis presents several techniques that cluster only a subset of the data into one or more groupings. We first develop a very general parametric approach called Bregman Bubble Clustering that can find multiple dense regions, and can scale to very large datasets. By using a fast iterative relocation based approach combined with a novel concept for improving local search called Pressurization, Bregman Bubble Clustering extends density-based clustering to a much larger set of problems. We also develop a seeding algorithm that can automatically determine the number of clusters, and make the viii results deterministic. We then describe a more focussed non-parametric alternative called Automated Hierarchical Density Shaving (Auto-HDS), a framework that consists of a fast, hierarchical, density-based clustering algorithm and an unsupervised model selection strategy. Auto-HDS can automatically select between clusters of different densities, present them in a compact hierarchy, and rank individual clusters using an innovative stability criteria. The Auto-HDS framework also provides a simple yet powerful 2-D visualization of the hierarchy of clusters that is useful for further exploring the dense clusters in high-dimensional datasets. We also developed a robust, memory efficient, platform independent, and open source Java based implementation of Auto-HDS called Gene DIVER (Gene Density Interactive Visual Explorer) that provides interactive clustering capabilities for high-throughput biological datasets. For problems where finding small dense regions is important, the parametric approach is applicable to a wide variety of scenarios and is scalable to very large datasets. On the other hand, Auto-HDS, the non-parametric approach, provides a powerful visualization, a compact clustering hierarchy, and interactive clustering: properties that are useful for biologists interested in finding and understanding small dense clusters of genes. Together, the two approaches greatly extend the scope of density based clustering in three different dimensions; the diversity of problems that density-based clustering can now be used with, the expanded capability to quickly understand and analyze the clusters in the data, and the scale of the problems that are now within reach of modest computing resources.