Overlapping community detection in massive social networks
MetadataShow full item record
Massive social networks have become increasingly popular in recent years. Community detection is one of the most important techniques for the analysis of such complex networks. A community is a set of cohesive vertices that has more connections inside the set than outside. In many social and information networks, these communities naturally overlap. For instance, in a social network, each vertex in a graph corresponds to an individual who usually participates in multiple communities. In this thesis, we propose scalable overlapping community detection algorithms that effectively identify high quality overlapping communities in various real-world networks. We first develop an efficient overlapping community detection algorithm using a seed set expansion approach. The key idea of this algorithm is to find good seeds and then greedily expand these seeds using a personalized PageRank clustering scheme. Experimental results show that our algorithm significantly outperforms other state-of-the-art overlapping community detection methods in terms of run time, cohesiveness of communities, and ground-truth accuracy. To develop more principled methods, we formulate the overlapping community detection problem as a non-exhaustive, overlapping graph clustering problem where clusters are allowed to overlap with each other, and some nodes are allowed to be outside of any cluster. To tackle this non-exhaustive, overlapping clustering problem, we propose a simple and intuitive objective function that captures the issues of overlap and non-exhaustiveness in a unified manner. To optimize the objective, we develop not only fast iterative algorithms but also more sophisticated algorithms using a low-rank semidefinite programming technique. Our experimental results show that the new objective and the algorithms are effective in finding ground-truth clusterings that have varied overlap and non-exhaustiveness. We extend our non-exhaustive, overlapping clustering techniques to co-clustering where the goal is to simultaneously identify a clustering of the rows as well as the columns of a data matrix. As an example application, consider recommender systems where users have ratings on items. This can be represented by a bipartite graph where users and items are denoted by two different types of nodes, and the ratings are denoted by weighted edges between the users and the items. In this case, co-clustering would be a simultaneous clustering of users and items. We propose a new co-clustering objective function and an efficient co-clustering algorithm that is able to identify overlapping clusters as well as outliers on both types of the nodes in the bipartite graph. We show that our co-clustering algorithm is able to effectively capture the underlying co-clustering structure of the data, which results in boosting the performance of a standard one-dimensional clustering. Finally, we study the design of parallel data-driven algorithms, which enables us to further increase the scalability of our overlapping community detection algorithms. Using PageRank as a model problem, we look at three algorithm design axes: work activation, data access pattern, and scheduling. We investigate the impact of different algorithm design choices. Using these design axes, we design and test a variety of PageRank implementations finding that data-driven, push-based algorithms are able to achieve a significantly superior scalability than standard PageRank implementations. The design choices affect both single-threaded performance as well as parallel scalability. The lessons learned from this study not only guide efficient implementations of many graph mining algorithms but also provide a framework for designing new scalable algorithms, especially for large-scale community detection.