Supervised Community Detection in Protein-interaction Networks
Community detection problems arise in several fields with networks, from biology and medicine to social studies and cybersecurity. Networks in these fields tend to be massive - for instance hu.MAP, the human protein interaction network assembled by our lab has 17 million edges. The problem of community detection here translates to finding protein complexes, which will help advance our understanding of several cellular functions and disease mechanisms. To solve this computationally challenging big data problem, we developed Super.Complex (short for Supervised Complex), a computational pipeline employing auto-ML and subgraph sampling techniques. While most state-of-the-art algorithms employ unsupervised graph clustering methods, a supervised approach holds more promise towards finding accurate communities mimicking the real world. With data on known communities becoming increasingly available in many applications, supervised methods become more relevant. Super.Complex implements a streamlined algorithm which samples subgraphs from the weighted network and classifies them as communities or non-communities via a supervised ML model. The steps involved are (i) sampling non-community data as random walks on the graph (ii) feature extraction and selection for known communities and generated non-communities, (iii) autoML pipeline for identification and training of thebest supervised machine learning model for binary classification of subgraphs (iv) intelligent sampling of candidate subgraphs for classification via 3 search techniques – greedy, iterative simulated annealing and metropolis. The last step is in fact a solution to the NP hard problem of identifying maximally scoring subgraphs in a network. The algorithm is applied to real data of different human and yeast protein interaction networks, yielding F1 scores ranging from 0.96 to 0.99 and identifying previously unknown biological complexes. Further, Super.Complex outperforms many state-of-the-art algorithms both in terms of accuracy and performance, with scalability to huge networks through its distributed framework.