Dataflow parallelism for large scale data mining

Daruru, Srivatsava

Dataflow parallelism for large scale data mining

dc.contributor.advisor	Ghosh, Joydeep	en
dc.contributor.committeeMember	Marin, Nena	en
dc.creator	Daruru, Srivatsava	en
dc.date.accessioned	2010-12-20T20:40:40Z	en
dc.date.available	2010-12-20T20:40:40Z	en
dc.date.available	2010-12-20T20:40:46Z	en
dc.date.issued	2010-08	en
dc.date.submitted	August 2010	en
dc.date.updated	2010-12-20T20:40:46Z	en
dc.description	text	en
dc.description.abstract	The unprecedented and exponential growth of data along with the advent of multi-core processors has triggered a massive paradigm shift from traditional single threaded programming to parallel programming. A number of parallel programming paradigms have thus been proposed and have become pervasive and inseparable from any large production environment. Also with the massive amounts of data available and with the ever increasing business need to process and analyze this data quickly at the minimum cost, there is much more demand for implementing fast data mining algorithms on cheap hardware. This thesis explores a parallel programming model called dataflow, the essence of which is computation organized by the flow of data through a graph of operators. This paradigm exhibits pipeline, horizontal and vertical parallelism and requires only the data of the active operators in memory at any given time allowing it to scale easily to very large datasets. The thesis describes the dataflow implementation of two data mining applications on huge datasets. We first develop an efficient dataflow implementation of a Collaborative Filtering (CF) algorithm based on weighted co-clustering and test its effectiveness on a large and sparse Netflix data. This implementation of the recommender system was able to rapidly train and predict over 100 million ratings within 17 minutes on a commodity multi-core machine. We then describe a dataflow implementation of a non-parametric density based clustering algorithm called Auto-HDS to automatically detect small and dense clusters on a massive astronomy dataset. This implementation was able to discover dense clusters at varying density thresholds and generate a compact cluster hierarchy on 100k points in less than 1.3 hours. We also show its ability to scale to millions of points as we increase the number of available resources. Our experimental results illustrate the ability of this model to “scale” well to massive datasets and its ability to rapidly discover useful patterns in two different applications.	en
dc.description.department	Computer Sciences	en
dc.format.mimetype	application/pdf	en
dc.identifier.uri	http://hdl.handle.net/2152/ETD-UT-2010-08-1838	en
dc.language.iso	eng	en
dc.subject	Dataflow processing	en
dc.subject	Data mining	en
dc.subject	Distributed computing	en
dc.subject	Large scale data mining	en
dc.subject	Parallel processing	en
dc.title	Dataflow parallelism for large scale data mining	en
dc.type.genre	thesis	en
thesis.degree.department	Computer Sciences	en
thesis.degree.discipline	Computer Sciences	en
thesis.degree.grantor	University of Texas at Austin	en
thesis.degree.level	Masters	en
thesis.degree.name	Master of Science in Computer Sciences	en

Access full-text files

Original bundle

Now showing 1 - 1 of 1

Name:: DARURU-THESIS.pdf
Size:: 1.05 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.12 KB
Format:: Plain Text
Description:

Download

Collections

UT Electronic Theses and Dissertations