Supervised Community Detection in Protein-interaction Networks

dc.creatorPalukuri, Meghana
dc.creatorMarcotte, Edward
dc.date.accessioned2020-01-15T21:46:40Z
dc.date.available2020-01-15T21:46:40Z
dc.date.issued2019
dc.description.abstractCommunity detection problems arise in several fields with networks, from biology and medicine to social studies and cybersecurity. Networks in these fields tend to be massive - for instance hu.MAP, the human protein interaction network assembled by our lab has 17 million edges. The problem of community detection here translates to finding protein complexes, which will help advance our understanding of several cellular functions and disease mechanisms. To solve this computationally challenging big data problem, we developed Super.Complex (short for Supervised Complex), a computational pipeline employing auto-ML and subgraph sampling techniques. While most state-of-the-art algorithms employ unsupervised graph clustering methods, a supervised approach holds more promise towards finding accurate communities mimicking the real world. With data on known communities becoming increasingly available in many applications, supervised methods become more relevant. Super.Complex implements a streamlined algorithm which samples subgraphs from the weighted network and classifies them as communities or non-communities via a supervised ML model. The steps involved are (i) sampling non-community data as random walks on the graph (ii) feature extraction and selection for known communities and generated non-communities, (iii) autoML pipeline for identification and training of thebest supervised machine learning model for binary classification of subgraphs (iv) intelligent sampling of candidate subgraphs for classification via 3 search techniques – greedy, iterative simulated annealing and metropolis. The last step is in fact a solution to the NP hard problem of identifying maximally scoring subgraphs in a network. The algorithm is applied to real data of different human and yeast protein interaction networks, yielding F1 scores ranging from 0.96 to 0.99 and identifying previously unknown biological complexes. Further, Super.Complex outperforms many state-of-the-art algorithms both in terms of accuracy and performance, with scalability to huge networks through its distributed framework.
dc.description.departmentTexas Advanced Computing Center (TACC)
dc.identifier.urihttps://hdl.handle.net/2152/79826
dc.identifier.urihttp://dx.doi.org/10.26153/tsw/6852
dc.language.isoeng
dc.relation.ispartofTACCSTER 2019 Proceedings
dc.relation.ispartofseriesTACCSTER Proceedings 2019
dc.rights.restrictionOpen
dc.subjectnetworks
dc.subjectcommunity detection
dc.titleSupervised Community Detection in Protein-interaction Networks
dc.typePoster
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Palukuri_Poster.pdf
Size:
1.49 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
Palukuri_SLA.pdf
Size:
361.12 KB
Format:
Adobe Portable Document Format
Description:
TSW License Agreement