Dataflow parallelism for large scale data mining

Repository

Dataflow parallelism for large scale data mining

Show simple record

dc.contributor.advisor Ghosh, Joydeep
dc.creator Daruru, Srivatsava
dc.date.accessioned 2010-12-20T20:40:40Z
dc.date.accessioned 2010-12-20T20:40:46Z
dc.date.available 2010-12-20T20:40:40Z
dc.date.available 2010-12-20T20:40:46Z
dc.date.created 2010-08
dc.date.issued 2010-12-20
dc.date.submitted August 2010
dc.identifier.uri http://hdl.handle.net/2152/ETD-UT-2010-08-1838
dc.description.abstract The unprecedented and exponential growth of data along with the advent of multi-core processors has triggered a massive paradigm shift from traditional single threaded programming to parallel programming. A number of parallel programming paradigms have thus been proposed and have become pervasive and inseparable from any large production environment. Also with the massive amounts of data available and with the ever increasing business need to process and analyze this data quickly at the minimum cost, there is much more demand for implementing fast data mining algorithms on cheap hardware. This thesis explores a parallel programming model called dataflow, the essence of which is computation organized by the flow of data through a graph of operators. This paradigm exhibits pipeline, horizontal and vertical parallelism and requires only the data of the active operators in memory at any given time allowing it to scale easily to very large datasets. The thesis describes the dataflow implementation of two data mining applications on huge datasets. We first develop an efficient dataflow implementation of a Collaborative Filtering (CF) algorithm based on weighted co-clustering and test its effectiveness on a large and sparse Netflix data. This implementation of the recommender system was able to rapidly train and predict over 100 million ratings within 17 minutes on a commodity multi-core machine. We then describe a dataflow implementation of a non-parametric density based clustering algorithm called Auto-HDS to automatically detect small and dense clusters on a massive astronomy dataset. This implementation was able to discover dense clusters at varying density thresholds and generate a compact cluster hierarchy on 100k points in less than 1.3 hours. We also show its ability to scale to millions of points as we increase the number of available resources. Our experimental results illustrate the ability of this model to “scale” well to massive datasets and its ability to rapidly discover useful patterns in two different applications.
dc.format.mimetype application/pdf
dc.language.iso eng
dc.subject Dataflow processing
dc.subject Data mining
dc.subject Distributed computing
dc.subject Large scale data mining
dc.subject Parallel processing
dc.title Dataflow parallelism for large scale data mining
dc.date.updated 2010-12-20T20:40:46Z
dc.contributor.committeeMember Marin, Nena
dc.description.department Computer Sciences
dc.type.genre thesis
dc.type.material text
thesis.degree.department Computer Sciences
thesis.degree.discipline Computer Sciences
thesis.degree.grantor University of Texas at Austin
thesis.degree.level Masters
thesis.degree.name Master of Science in Computer Sciences

Files in this work

Download File: DARURU-THESIS.pdf
Size: 1.104Mb
Format: application/pdf

This work appears in the following Collection(s)

Show simple record


Advanced Search

Browse

My Account

Statistics

Information