Mining statistical correlations with applications to software analysis
Machine learning, data mining, and statistical methods work by representing real-world objects in terms of feature sets that best describe them. This thesis addresses problems related to inferring and analyzing correlations among such features. The contributions of this thesis are two-fold: we develop formulations and algorithms for addressing correlation mining problems, and we also provide novel applications of our methods to statistical software analysis domains. We consider problems related to analyzing correlations via unsupervised approaches, as well as algorithms that infer correlations using fully-supervised or semi-supervised information. In the context of correlation analysis, we propose the problem of correlation matrix clustering which employs a k-means style algorithm to group sets of correlations in an unsupervised manner. Fundamental to this algorithm is a measure for comparing correlations called the log-determinant (LogDet) divergence, and a primary contribution of this thesis is that of interpreting and analyzing this measure in the context of information theory and statistics. Additionally based on the LogDet divergence, we present a metric learning problem called Information-Theoretic Metric Learning which uses semi-supervised or fully-supervised data to infer correlations for parametrization of a Mahalanobis distance metric. We also consider the problem of learning Mahalanobis correlation matrices in the presence of high dimensions when the number of pairwise correlations can grow very large. In validating our correlation mining methods, we consider two in-depth and real-world statistical software analysis problems: software error reporting and unit test prioritization. In the context of Clarify, we investigate two types of correlation mining applications: metric learning for nearest neighbor software support, and decision trees for error classification. We show that our metric learning algorithms can learn program-specific similarity models for more accurate nearest neighbor comparisons. In the context of decision tree learning, we address the problem of learning correlations with associated feature costs, in particular, the overhead costs of software instrumentation. As our second application, we present a unit test ordering algorithm which uses clustering and nearest neighbor algorithms, along with a metric learning component, to efficiently search and execute large unit test suites.