Co-clustering algorithms : extensions and applications
Co-clustering is rather a recent paradigm for unsupervised data analysis, but it has become increasingly popular because of its potential to discover latent local patterns, otherwise unapparent by usual unsupervised algorithms such as k-means. Wide deployment of co-clustering, however, requires addressing a number of practical challenges such as data transformation, cluster initialization, scalability, and so on. Therefore, this thesis focuses on developing sophisticated co-clustering methodologies to maturity and its ultimate goal is to promote co-clustering as an invaluable and indispensable unsupervised analysis tool for varied practical applications. To achieve this goal, we explore the three specific tasks: (1) development of co-clustering algorithms to be functional, adaptable, and scalable (co-clustering algorithms); (2) extension of co-clustering algorithms to incorporate application-specific requirements (extensions); and (3) application of co-clustering algorithms broadly to existing and emerging problems in practical application domains (applications). As for co-clustering algorithms, we develop two fast Minimum Sum-Squared Residue Co-clustering (MSSRCC) algorithms [CDGS04], which simultaneously cluster data points and features via an alternating minimization scheme and generate co-clusters in a “checkerboard” structure. The first captures co-clusters with constant values, while the other discovers co-clusters with coherent “trends” as well as constant values. We note that the proposed algorithms are two special cases (bases 2 and 6 with Euclidean distance, respectively) of the general co-clustering framework, Bregman Co-clustering (BCC) [BDG+07], which contains six Euclidean BCC and six I-divergence BCC algorithms. Then, we substantially enhance the performance of the two MSSRCC algorithms by escaping from poor local minima and resolving the degeneracy problem of generating empty clusters in partitional clustering algorithms through the three specific strategies: (1) data transformation; (2) deterministic spectral initialization; and (3) local search strategy. Concerning co-clustering extensions, we investigate general algorithmic strategies for the general BCC framework, since it is applicable to a large class of distance measures and data types. We first formalize various data transformations for datasets with varied scaling and shifting factors, mathematically justify their effects on the six Euclidean BCC algorithms, and empirically validate the analysis results. We also adapt the local search strategy, initially developed for the two MSSRCC algorithms, to all the twelve BCC algorithms. Moreover, we consider variations of cluster assignments and cluster updates, including greedy vs. non-greedy cluster assignment, online vs. batch cluster update, and so on. Furthermore, in order to provide better scalability and usability, we parallelize all the twelve BCC algorithms, which are capable of co-clustering large-scaled datasets over multiple processors. Regarding co-clustering applications, we extend the functionality of BCC to incorporate application-specific requirements: (1) discovery of inverted patterns, whose goal is to find anti-correlation; (2) discovery of coherent co-clusters from noisy data, whose purpose is to do dimensional reduction and feature selection; and (3) discovery of patterns from time-series data, whose motive is to guarantee critical time-locality. Furthermore, we employ co-clustering to pervasive computing for mobile devices, where the task is to extract latent patterns from usage logs as well as to recognize specific situations of mobile-device users. Finally, we demonstrate the applicability of our proposed algorithms for aforementioned applications through empirical results on various synthetic and real-world datasets. In summary, we present co-clustering algorithms to discover latent local patterns, propose their algorithmic extensions to incorporate specific requirements, and provide their applications to a wide range of practical domains.