Browsing by Subject "Data mining"
Now showing 1 - 20 of 33
- Results Per Page
- Sort Options
Item A hybrid reduced approach to handle missing values in type 2 diabetes prediction(2016-05-06) You, Xinqi; Saar-Tsechansky, Maytal; Gawande, KishoreDiabetes gains more attention among medical institutions and health care organizations as the increasing trend of diabetes around the world. In the United States, 29.1 million people or 9.3% of U.S. population are diagnosed with diabetes. About 86 million people are categorized as pre-diabetes and 15-30% of them will develop diabetes within 5 years. To tackle this challenge, National Diabetes Prevention Program (DPP) was introduced in 2002 and it reduces risk of diabetes by 58% through lifestyle change program. In order to help select a better group of prediabetes for intervention and maximize the cost-effectiveness of the program, we propose a Hybrid Reduced approach to handle missing values when predicting type 2 diabetes. This approach deals with 4 challenges in electronic medical records: missing values, missing not at random, class imbalance and predicting at a longer window (2-year). We select three ensemble predictive models: AdaBoost.M1, Gradient Boosting and Extremely Randomized Trees and apply this approach across 7 years to assess its robustness. The Hybrid Reduced approach includes two sub-approaches: Hybrid Reduced Organic and Hybrid Reduced Imputed. Throughout the experiments, Hybrid Reduced Imputed is the best performer and achieves a 5-7% improvement in precision. By simply using this approach, we could save $278 million for healthcare and improve people’s health conditionItem A multi-scale framework for graph based machine learning problems(2017-05) Shin, Donghyuk; Dhillon, Inderjit S.; Whinston, Andrew B; Qiu, Lili; Chakrabarti, DeepayanGraph data have become essential in representing and modeling relationships between entities and complex network structures in various domains such as social networks and recommender systems. As a main contributor of the recent Big Data trend, the massive scale of graphs in modern machine learning problems easily overwhelms existing methods and thus sophisticated scalable algorithms are needed for real-world applications. In this thesis, we develop a novel multi-scale framework based on the divide-and-conquer principle as an effective and scalable approach for machine learning tasks involving large sparse graphs. We first demonstrate how our multi-scale framework can be applied to the problem of computing the spectral decomposition of massive graphs, which is one of the most fundamental low-rank matrix approximations used in numerous machine learning tasks. While popular solvers suffer from slow convergence, especially when the desired rank is large, our method exploits the clustering structure of the graph and achieves superior performance compared to existing algorithms in terms of both accuracy and scalability. While the main goal of the divide-and-conquer approach is to efficiently compute solutions for the original problem, the proposed multi-scale framework further admits an attractive but less obvious feature that machine learning problems can benefit from. Particularly, we consider partial solutions of the subproblems computed in the process as localized models of the entire problem. By doing so, we can combine models at multiple scales from local to global and generate a holistic view of the underlying problem to achieve better performance than a single global view. We adapt such multi-scale view for the problems of link prediction in social networks and collaborative filtering in recommender systems with additional side information to obtain a model that can make accurate and robust predictions in a scalable manner.Item Celebrities of the digital era : conceptualization of social media celebrities as brand endorsers(2018-05-07) Choi, Jin-A; Lee, Wei-Na, 1957-; Wilcox, Gary B.; Atkinson, Lucy; Whittaker, TiffanyThe digital era is full of enhanced opportunities for advertising. A new phenomenon rises as an effective marketing communication strategy: the rise in fame and status of ordinary social media users to “social media celebrities (SMCs).” Despite the interest and demand, there is limited academic scholarship and understanding of SMCs and their potential influence as brand endorsers. Therefore, a series of in-depth interviews were carried out to conceptualize the phenomenon of SMCs. Findings suggest that SMCs are perceived as authentic, have transparent motives, enable consumers with goal-oriented outcomes, and have exclusive recognition and acceptance within a niche and segmented group. Additionally, a definition of SMCs is offered. The findings were triangulated with a secondary set of in-depth interviews which corroborated the findings. Finally, data mining and textual analysis of 135,754 YouTube comments from SMC endorsement videos using SAS Text Miner 12.1 provide a large-scale, quantitative validation to the findings.Item Classification of encrypted cloud computing service traffic using data mining techniques(2011-12) Qian, Cheng; Ghosh, JoydeepIn addition to the wireless network providers’ need for traffic classification, the need is more and more common in the Cloud Computing environment. A data center hosting Cloud Computing services needs to apply priority policies and Service Level Agreement (SLA) rules at the edge of its network. Overwhelming requirements about user privacy protection and the trend of IPv6 adoption will contribute to the significant growth of encrypted Cloud Computing traffic. This report presents experiments focusing on application of data mining based Internet traffic classification methods to classify encrypted Cloud Computing service traffic. By combining TCP session level attributes, client and host connection patterns and Cloud Computing service Message Exchange Patterns (MEP), the best method identified in this report yields 89% overall accuracy.Item Clinically interpretable models for healthcare data(2015-12) Ho, Joyce Carmen; Ghosh, Joydeep; Vishwanath, Sriram; Vikalo, Haris; Sanghavi, Sujay; Sun, JimengThe increasing availability of electronic health records (EHRs) has spurred the adoption of data-driven approaches to provide additional insights for diagnoses, prognoses, and cost-effective patient treatment and management. The records are composed of a diverse array of data that includes both structured information (e.g., diagnoses, medications, and lab results) and unstructured clinical narratives notes (e.g., physician's observations, progress notes, etc). Thus, EHRs are a rich source of patient information. However, there are several formidable challenges with using EHRs that have limited their utility for clinical research so far. Problems include data quality; high-dimensional heterogenous information from various sources; privacy; and interoperability across institutions. Further hampering the acceptance of data-driven models is the lack of interpretability of their results. Physicians are accustomed to reasoning based on concise clinical concepts (or phenotypes) rather than directly on high-dimensional EHR data. Unfortunately, these records do not readily map to simple phenotypes, let alone more sophisticated and multifaceted ones. This dissertation investigates the development of clinically interpretable models for EHR data using dimensionality reduction techniques. We posit that clinical concepts are representations in lower dimensional latent spaces. Yet, standard dimensionality reduction techniques alone are insufficient to derive concise and relevant medical concepts from EHR data. We explore two approaches: (1) state space models to dynamically track a patient's cardiac arrest risk, and (2) non--negative matrix and tensor factorization models to generate concise and clinically relevant phenotypes. Our approaches yield clinically interpretable models with minimal human intervention and provides a powerful, and data-driven framework for transforming high-dimensional EHR data into medical concepts.Item Computational discovery of genetic targets and interactions : applications to lung cancer(2016-05) Young, Jonathan Hubert; Marcotte, Edward M.; Gonzalez, Oscar; Dhillon, Inderjit; Elber, Ron; Wilke, ClausWe present new modes of computational drug discovery in each of the three key themes of target identification, mechanism, and therapy regimen design. In identifying candidate targets for therapeutic intervention, we develop novel applications of unsupervised clustering of whole genome RNAi screening in prioritizing biological systems whose inhibition differentially sensitizes diseased cells apart from a normal population. When applied to lung cancer, our approach identified protein complexes for which various tumor subtypes are especially dependent. Consequently, each complex represents a candidate drug target specifically intended for a particular patient sub-population. The cellular functions impacted by the protein complexes include splicing, translation, and protein folding. We obtained experimental validation for the predicted sensitivity of a lung adenocarcinoma cell line to Wnt inhibition. For our second theme, we focus on genetic interactions as a mechanism underlying sensitivity to target inhibition. Experimental characterization of such interactions has relied on brute-force assessment of gene pairs. To alleviate the experimental burden, our hypothesis is that functionally related genes tend to share common genetic interaction partners. We thereby examine a method that recognizes functional network clusters to generate high-confidence predictions of different types of genetic interactions across yeast, fly and human. Our predictions are leave-one-out cross-validated on known interactions. Moreover, by using yeast as a model, we simulatr the degree to which further human genetic interactions need to be screened in order to understand their distribution in biological systems. Finally, we employ yeast as a model organism to assess the feasibility of designing synergistic or antagonistic drug pairs based on genetic interactions between their targets. The hypothesis is that if the target genes of one chemical compound are close to those of a second compound in a genetic interaction network, then synergistic or antagonistic growth effects will occur. Proximity between sets in a gene network are quantified through graph metrics, and predictions of synergy and antagonism are validated by literature-curated gold standards. Ultimately, integrating knowledge of druggable targets, how gene perturbations interact with the genetic background of an individual, and design of personalized therapeutic regimens will provide a general framework to treat a variety of diseases.Item Data driven analysis of fast oxide ion diffusion in solid oxide fuel cell cathodes(2015-08) Miller, Alexander Scot; Benedek, Nicole; Yu, GuihuaThe goal of this study was to determine whether atomic-scale features (related to composition and crystal structure) of perovskite and perovskite-related materials could be used to predict fast oxide ion diffusion for Solid Oxide Fuel Cell (SOFC) applications; materials that can be used as SOFC cathodes were a particular focus. One hundred and twenty six pairs of diffusion (D*) and surface exchange (k*) coefficients for a variety of materials were collected from literature sources published between 1991 and 2015. A website was created with these data for public viewing. Statistical tests revealed that diffusion measurements have significant differences at 400K, 700K, and 1000K when grouped according to material family and sample type. Models predicting diffusion rates were created from atomic-scale features at several temperatures between 400K and 1000K. Perovskite and double-perovskite models explained >85% of the variance in ln(D*k*) at 800K-1000K, meaning these models successfully predicted ln(D*k*) more than 85% of the time. These models explained 55%-75% of the variance at lower temperatures (400K-700K). Materials whose B-site cations had the highest electron affinities showed the fastest diffusion at all temperatures. Thus, these models suggest using B-site cations with high electron affinities (i.e. atoms that are easily reduced) may increase fuel cell performance, even at low and intermediate temperatures.Item Data mining techniques for classifying RNA folding structures(2016-08) Kim, Vince; Garg, Vijay K. (Vijay Kumar), 1963-; Gutell, Robin RRNA is a crucial biological molecule that is critical for protein synthesis. Significant research has been done on folding algorithms for RNA, in particular the 16S rRNA of bacteria and archaea. Rather than modifying current works on these folding algorithms, this report ventures into the pioneering works for data mining the same 16S rRNA. Initial works were based on a single complex helix across seven organisms. However, classification analysis proved to be inaccurate due to severe multicollinearity in the data set. A secondary data mining analysis was done on the entire RNA sequence of the same seven organisms, and was successfully used to sequentially categorically predict the characteristic of a given nucleotide in the RNA sequence.Item Data-mining the Ubuntu Linux Distribution for bug analysis and resolution(2012-08) Arges, Christopher John; Stewart, Kate; Ghosh, JoydeepThe Ubuntu Linux Distribution represents a massive investment of time and human effort to produce a reliable computing experience for users. To accomplish these goals, software bugs must be tracked and fixed. However, as the number of users increase and bug reports grow advanced tools such as data mining must be used to increase the effectiveness of all contributors to the project. Thus, this report involved collecting a large amount of bug reports into a database and calculating relevant statistics. Because of the diversity and quantity of bug reports, contributors must find which bugs are most relevant and important to work on. One study in this report created an automatic way to determine who is best fit to solve a particular bug by using classification techniques. In addition, this report explores how to initially classify if a bug report will be eventually marked invalid or not.Item Dataflow parallelism for large scale data mining(2010-08) Daruru, Srivatsava; Ghosh, Joydeep; Marin, NenaThe unprecedented and exponential growth of data along with the advent of multi-core processors has triggered a massive paradigm shift from traditional single threaded programming to parallel programming. A number of parallel programming paradigms have thus been proposed and have become pervasive and inseparable from any large production environment. Also with the massive amounts of data available and with the ever increasing business need to process and analyze this data quickly at the minimum cost, there is much more demand for implementing fast data mining algorithms on cheap hardware. This thesis explores a parallel programming model called dataflow, the essence of which is computation organized by the flow of data through a graph of operators. This paradigm exhibits pipeline, horizontal and vertical parallelism and requires only the data of the active operators in memory at any given time allowing it to scale easily to very large datasets. The thesis describes the dataflow implementation of two data mining applications on huge datasets. We first develop an efficient dataflow implementation of a Collaborative Filtering (CF) algorithm based on weighted co-clustering and test its effectiveness on a large and sparse Netflix data. This implementation of the recommender system was able to rapidly train and predict over 100 million ratings within 17 minutes on a commodity multi-core machine. We then describe a dataflow implementation of a non-parametric density based clustering algorithm called Auto-HDS to automatically detect small and dense clusters on a massive astronomy dataset. This implementation was able to discover dense clusters at varying density thresholds and generate a compact cluster hierarchy on 100k points in less than 1.3 hours. We also show its ability to scale to millions of points as we increase the number of available resources. Our experimental results illustrate the ability of this model to “scale” well to massive datasets and its ability to rapidly discover useful patterns in two different applications.Item Distributed learning using generative models(2006) Merugu, Srujana; Ghosh, JoydeepItem Essays on inflation forecast based rules, robust policies and sovereign debt(2004) Rodriguez, Arnulfo; Kendrick, David A.The success of inflation reduction in industrial countries along with the adoption of inflation targeting regimes by many central banks has prompted considerable interest in “feedback rules” for inflation targeting. Over the past few years, much research has been devoted to assessing the performance of these rules. The first essay provides an optimal policy responsiveness of inflation forecast based (IFB) rules to inflation and/or output shocks in order to lead inflation and output state variables back to their equilibrium value. The rendered system of equations is a bilinear one that becomes the restriction for the quadratic criterion function used in control theory problems. There has been a recent interest in the use of robust control techniques for economic policies. Analyzing a control variable response as the degree of robustness rises is important in advancing our understanding of the application of robust control methods to economic models. The second and third essays provide an analytical framework derived from one-state one-control robust control problem in order to understand the relationship between the control variable and unstructured model uncertainty. Seeking a robust policy rule for a variety of different structural macroeconomic models is an important exercise to determine if an IFB rule would be adequate to meet some performance criteria in the face of model uncertainty. Robust performance is all about finding a rule that makes it possible to have some similar, if not equal, performance across different models. However, before a rule becomes robust performance-wise, it must be robust stability-wise. The fourth essay provides a way of searching for IFB rules that are robust to two different structural macroeconomic models. First, we find the IFB rules that accomplish robust stability by using the root-locus method. Second, we find a subset of such rules that accomplish similar performances across both models – i.e. robust rules. Finally, the fifth essay uses and compares some data mining techniques to understand and predict sovereign debt rescheduling. Receiver Operating Characteristic Curves are used to measure the discrimination power of models. Moreover, the issue of interpretability of models for sovereign debt rescheduling is addressed.Item Eukaryotic transcriptional regulation : from data mining to transcriptional profiling(2008-12) Morgan, Xochitl Chamorro; Iyer, Vishwanath R.Survival of cells and organisms requires that each of thousands of genes is expressed at the correct time in development, in the correct tissue, and under the correct conditions. Transcription is the primary point of gene regulation. Genes are activated and repressed by transcription factors, which are proteins that become active through signaling, bind, sometimes cooperatively, to regulatory regions of DNA, and interact with other proteins such as chromatin remodelers. Yeast has nearly six thousand genes, several hundred of which are transcription factors; transcription factors comprise around 2000 of the 22,000 genes in the human genome. When and how these transcription factors are activated, as well as which subsets of genes they regulate, is a current, active area of research essential to understanding the transcriptional regulatory programs of organisms. We approached this problem in two divergent ways: first, an in silico study of human transcription factor combinations, and second, an experimental study of the transcriptional response of yeast mutants deficient in DNA repair. First, in order to better understand the combinatorial nature of transcription factor binding, we developed a data mining approach to assess whether transcription factors whose binding motifs were frequently proximal in the human genome were more likely to interact. We found many instances in the literature in which over-represented transcription factor pairs co-regulated the same gene, so we used co-citation to assess the utility of this method on a larger scale. We determined that over-represented pairs were more likely to be co-cited than would be expected by chance. Because proper repair of DNA is an essential and highly-conserved process in all eukaryotes, we next used cDNA microarrays to measure differentially expressed genes in eighteen yeast deletion strains with sensitivity to the DNA cross-linking agent methyl methane sulfonate (MMS); many of these mutants were transcription factors or DNA-binding proteins. Combining this data with tools such as chromatin immunoprecipitation, gene ontology analysis, expression profile similarity, and motif analysis allowed us to propose a model for the roles of Iki3 and of YML081W, a poorly-characterized gene, in DNA repair.Item An exploratory study of teacher retention using data mining(2014-05) Krause, Gladys Helena; Marshall, Jill Ann; Carmona Domínguez, Guadalupe de la PazThe object of this investigation is to report a study of mathematics teacher retention in the Texas Education System by generating a model that allows the identification of crucial factors that are associated with teacher retention in their profession. This study answers the research question: given a new mathematics teacher with little or no service in the Texas Education System, how long might one expect her to remain in the system? The basic categories, used in this study to describe teacher retention are: long term (10 and more years of service), medium term (5 to 9 years of service), and short term (1 to 4 years of service). The research question is addressed by generating a model through data mining techniques and using teacher data and variables from the Texas Public Education Information Management System (PEIMS) that allows a descriptive identification of those factors that are crucial in teacher retention. Research on mathematics teacher turnover in Texas has not yet focused on teacher characteristics. The literature review presented in this investigation shows that teacher characteristics are important in studying factors that may influence teachers' decisions to stay or to leave the system. This study presents the field of education, and the state of Texas, with an opportunity to isolate those crucial factors that keep mathematics teachers from leaving the teaching profession, which has the potential to inform policy makers and other educators when making decisions that could have an impact on teacher retention. Also, the methodology applied, data mining, allows this study to take full advantage of a collection of valuable resources provided by the Texas Education Agency (TEA) through the Public Education Information Management System (PEIMS), which has not yet been used to study the phenomenon of teacher retention.Item The fluviageny, a method for analyzing temporal river fragmentation using phylogenetics(2015-05) Gordon, Andrew Lloyd; Howison, James; Arctur, David KPhylogenetic trees have historically been used to determine evolutionary relatedness between organisms. In the past few decades, as we've developed increasingly powerful computational algorithms and toolsets for performing analyses using phylogenetic methods, the use of these trees has expanded into other areas, including biodiversity informatics and geoinformatics. This report proposes using phylogenetic methods to create "fluviagenies" - trees that represent the effects of river fragmentation over time caused by damming. Faculty at the Center for Research in Water Resources at the University of Texas worked to develop tools and documentation for automating the creation of river segment codes (a.k.a., "fluvcodes") based on spatiotemporal data. Python was used to generate fluviageny trees from lists of these codes. The resulting trees can be exported into the appropriate data format for use with various phylogenetics programs. The Fishes of Texas Database (fshesoftexas.org), a comprehensive geospatial database of Texas fish occurrences aggregated and normalized from 42 museum collections around the world, was employed to create an example of how this tool might be used to analyze and hypothesize changes in fish populations as a consequence of river fragmentation. Additionally, this paper serves to theorize and analyze past and future potential uses for phylogenetic trees in various other fields of informatics.Item Integrating data mining and transient modeling for anomaly detection and condition assessment in water distribution systems(2021-07-21) Xing, Lu, Ph. D.; Sela, Lina; Caramanis, Constantine; Passalacqua, Paola; Nagy, ZoltanThe needs for more resilient, sustainable, and intelligent water distribution systems (WDSs) are becoming increasingly urgent, due to the challenges imposed by rapid urbanization, depleting water resources, infrastructure degradation, and climate change. However, the performance of water systems is often difficult to monitor and model due to the size, complexity, underground location, and the large amount of data needed to fully grasp how systems function. Hydraulic transients in WDSs can disturb the steady-state flow conditions by introducing fast flow changes, imposing abrupt internal pressure forces on pipeline systems, and generating pressure waves that propagate rapidly through piped networks. These disturbances have been identified as one of the major contributing factors to pipe deterioration and catastrophic failure. Complementarily, transient-based approaches are promising in fault detection, condition assessment, and pressure management, because a significant amount of information about the water systems can be revealed during a very short period as the transient wave travels quickly through the piped networks. The goal of this dissertation is to leverage transient modeling, data mining, machine learning, and remote sensing to develop physics-based and data-driven models to achieve high-performance modeling and improve the monitoring capabilities in WDSs. In the context of system monitoring, a two-step time series data mining approach is proposed to extract the patterns, intensity, and frequency from high-resolution pressure data. The proposed approach provides a fast and efficient way to discover the hidden information in WDSs by analyzing high-resolution pressure signals from distributed sensors. From the perspective of physics-based modeling, this dissertation contributed the first open-source Python package, TSNet, for transient modeling in WDSs. TSNet exhibits the capabilities of simulating various transient conditions, including background leaks, pipe bursts, and operational changes in valves and pumps. Moreover, this dissertation proposed an approach for utilizing transient modeling to design distributed sensor networks, such that robust network-wide monitoring can be achieved under data and model uncertainties. The proposed sensor placement strategy addresses the challenge of limited budget, as well as data and modeling uncertainties by incorporating robust representation and tolerance analysis into an optimization framework where the objective is to achieve best detection and identification of all possible events. Finally, this dissertation proposes the graph neural network (GNN) models as a learning-based approach for data assimilation and state estimation in WDSs. In addition to the classical supervised training approach, a semi-supervised training scheme is also formulated to provide a novel paradigm where physical laws and measurement data can be incorporated to improve the training of GNN models. The results demonstrate that the GNN model trained with semi-supervised approach is a promising approach for data assimilation and state estimation in WDSs.Item Integrating top-down and bottom-up approaches in inductive logic programming: applications in natural language processing and relational data mining(2003) Tang, Lap Poon Rupert; Mooney, Raymond J. (Raymond Joseph)Inductive Logic Programming (ILP) is the intersection of Machine Learning and Logic Programming in which the learner’s hypothesis space is the set of logic programs. There are two major ILP approaches: top-down and bottom-up. The former searches the hypothesis space from general to specific while the latter the other way round. Integrating both approaches has been demonstrated to be more effective. Integrated ILP systems were previously developed for two tasks: learning semantic parsers (Chillin), and mining relational data (Progol). Two new integrated ILP systems for these tasks that overcome limitations of existing methods will be presented. Cocktail is a new ILP algorithm for inducing semantic parsers. For this task, two features of a parse state, functional structure and context, provide important information for disambiguation. A bottom-up approach is more suitable for learning the former, while top-down is better for the latter. By allowing both approaches to induce program clauses and choosing the best combination of their results, Cocktail learns more effective parsers. Experimental results on learning natural-language interfaces for two databases demonstrate that it learns more accurate parsers than Chillin, the previous best method for this task. Beth is a new integrated ILP algorithm for relational data mining. The Inverse Entailment approach to ILP, implemented in the Progol and Aleph systems, starts with the construction of a bottom clause, the most specific hypothesis covering a seed example. When mining relational data with a large number of background facts, the bottom clause becomes intractably large, making learning very inefficient. A top-down approach heuristically guides the construction of clauses without building a bottom clause; however, it wastes time exploring clauses that cover no positive examples. By using a top-down approach to heuristically guide the construction of generalizations of a bottom clause, Beth combines the strength of both approaches. Learning patterns for detecting potential terrorist activity is a current challenge problem for relational data mining. Experimental results on artificial data for this task with over half a million facts show that Beth is significantly more efficient at discovering such patterns than Aleph and m-Foil, two leading ILP systems.Item Learning from aggregated data(2019-02-11) Bhowmik, Avradeep; Ghosh, Joydeep; Vikalo, Haris; Sanghavi, Sujay; Dimakis, Georgios-Alexandros; Koyejo, OluwasanmiData aggregation is ubiquitous in modern life. Due to various reasons like privacy, scalability, robustness, etc., ground truth data is often subjected to aggregation before being released to the public, or utilised by researchers and analysts. Learning from aggregated data is a challenging problem that requires significant algorithmic innovation, since naive application of standard techniques to aggregated data is vulnerable to the ecological fallacy. In this work, we explore three different versions of this setting. First, we tackle the problem of using generalised linear models when features/covariates are fully observed but the targets are only available as histograms- a common scenario in the healthcare domain where many datasets contain both non-sensitive attributes like age, sex, zip-code, etc., as well as privacy sensitive attributes like healthcare records. We introduce an efficient algorithm that uses alternating data imputation and GLM estimation steps to learn predictive models in this setting. Next, we look at the problem of learning sparse linear models when both features and targets are in aggregated form, specified as empirical estimates of group-wise means computed over different sub-groups of the population. We show that if the true sub-populations are heterogeneous enough, the optimal sparse parameter can be recovered within an arbitrarily small tolerance even in the presence of noise, provided the empirical estimates are obtained from a sufficiently large number of observations. Third, we tackle the scenario of predictive modelling with data that is subjected to spatio-temporal aggregation. We show that by formulating the problem in the frequency domain, we can bypass the mathematical and representational challenges that arise due to non-uniform aggregation, misaligned sampling periods and aliasing. We introduce a novel algorithm that uses restricted Fourier transforms to estimate a linear model which, when applied to spatio-temporally aggregated data, has a generalisation error that is provably close to the optimal performance by the best possible linear model that can be learned from the non-aggregated data set. We then focus our attention on the complementary problem that involves designing aggregation strategies that can allow learning, as well as developing algorithmic techniques that can use only the aggregates to train a model that works on individual samples. We motivate our methods by using the example of Gaussian regression, and subsequently extend our techniques to subsume binary classifiers and generalised linear models. We deonstrate the effectiveness of our techniques with empirical evaluation on data from healthcare and telecommunication. Finally, we present a concrete example of our methods applied to a real life practical problem. Specifically, we consider an application in the domain of online advertising where the complexity of bidding strategies require accurate estimates of most probable cost-per-click or CPC incurred by advertisers, but the data used for training these CPC prediction models are only available as aggregated invoices supplied by an ad publisher on a daily or hourly basis. We introduce a novel learning framework that can use aggregates computed at varying levels of granularity for building individual-level predictive models. We generalise our modelling and algorithmic framework to handle data from diverse domains, and extend our techniques to cover arbitrary aggregation paradigms like sliding windows and overlapping/non-uniform aggregation. We show empirical evidence for the efficacy of our techniques with experiments on both synthetic data and real data from the online advertising domain as well as healthcare to demonstrate the wider applicability of our framework.Item Matrix nearness problems in data mining(2007) Sra, Suvrit, 1976-; Dhillon, Inderjit S.This thesis addresses some fundamental problems in data mining and machine learning that may be cast as matrix nearness problems. Some exam- ples of well-known nearness problems are: low-rank approximations, sparse approximations, clustering, co-clustering, kernel learning, and independent components analysis. In this thesis we study two types of matrix nearness problems. In the first type, we compute a low-rank matrix approximation to a given input matrix, thereby representing it more efficiently and hopefully discovering the latent structure within the input data. In the second kind of nearness problem we seek to either learn a parameterized model of/from the input data, or the data represents noisy measurements of some underly- ing objects and we wish to recover the original measurements. Both types of problems can be naturally approached by computing an output model/matrix that is “near” the input. The specific nearness problems that we study in this thesis include: i) nonnegative matrix approximation (NNMA), ii) incremental low-rank matrix approximations, iii) general low-rank matrix approximations via convex op- timization, iv) learning a parametric mixture model for data, specifically for directional data, and v) metric nearness. NNMA is a recent powerful matrix decomposition technique that ap- proximates a nonnegative input matrix by a low-rank approximation composed of nonnegative factors. It has found wide applicability across a broad spec- trum fields, ranging from problems in text analysis, image processing, and gene microarray analysis, to music transcription. We develop several new gen- eralizations to the NNMA problem and derive efficient iterative algorithms for computing the associated approximation. Furthermore, we also provide efficient software which implements many of the derived algorithms. With growing input matrix sizes, sometimes low-rank approximation techniques themselves can become computationally expensive. For such situa- tions, and to aid model selection (the rank of the approximation), we develop incremental versions of low-rank matrix approximations, where the approxi- mation is obtained one rank at a time. There are several applications of such a scheme, for example, topic discovery from a collection of documents. We also develop methods based on large-scale convex optimization for computing low-rank approximations to the input data. Our approach can deal with large scale data, while permitting incorporation of constraints more general than nonnegativity if desired. Our approach has some beneficial byproducts—it yields new methods for solving the nonnegative least squares problem, as well as l1-norm regression. The next nearness problem that we look at is that of learning a para- metric probabilistic mixture model for the data. Here one estimates a param- eter matrix given the input data, where the estimation process is implicitly regularized to avoid over-fitting. In particular we solve the parameter estimation problem for two fundamental high-dimensional directional distributions, namely the von Mises-Fisher and Watson distributions. Parameter estimation for these distributions is highly non-trivial and we present efficient methods for it. The final nearness problem that we study is a more typical matrix nearness problem, which is called metric nearness. The goal here is to find a distance matrix (i.e., a matrix whose entries satisfy the triangle inequality) that is “nearest” to an input matrix of dissimilarity values. For most of the algorithms that we develop in this thesis, we also pro- vide software that implements them. This software will be of use to both researchers and practitioners who want to experiment with our algorithms.Item Mining statistical correlations with applications to software analysis(2008-08) Davis, Jason Victor; Dhillon, Inderjit S.; Witchel, EmmettMachine learning, data mining, and statistical methods work by representing real-world objects in terms of feature sets that best describe them. This thesis addresses problems related to inferring and analyzing correlations among such features. The contributions of this thesis are two-fold: we develop formulations and algorithms for addressing correlation mining problems, and we also provide novel applications of our methods to statistical software analysis domains. We consider problems related to analyzing correlations via unsupervised approaches, as well as algorithms that infer correlations using fully-supervised or semi-supervised information. In the context of correlation analysis, we propose the problem of correlation matrix clustering which employs a k-means style algorithm to group sets of correlations in an unsupervised manner. Fundamental to this algorithm is a measure for comparing correlations called the log-determinant (LogDet) divergence, and a primary contribution of this thesis is that of interpreting and analyzing this measure in the context of information theory and statistics. Additionally based on the LogDet divergence, we present a metric learning problem called Information-Theoretic Metric Learning which uses semi-supervised or fully-supervised data to infer correlations for parametrization of a Mahalanobis distance metric. We also consider the problem of learning Mahalanobis correlation matrices in the presence of high dimensions when the number of pairwise correlations can grow very large. In validating our correlation mining methods, we consider two in-depth and real-world statistical software analysis problems: software error reporting and unit test prioritization. In the context of Clarify, we investigate two types of correlation mining applications: metric learning for nearest neighbor software support, and decision trees for error classification. We show that our metric learning algorithms can learn program-specific similarity models for more accurate nearest neighbor comparisons. In the context of decision tree learning, we address the problem of learning correlations with associated feature costs, in particular, the overhead costs of software instrumentation. As our second application, we present a unit test ordering algorithm which uses clustering and nearest neighbor algorithms, along with a metric learning component, to efficiently search and execute large unit test suites.