Machine learning methods for community detection in networks using known community information

Abstract

In a network, the problem of community detection refers to finding groups of nodes and edges that form ‘communities’ relevant to the field, such as groups of people with common interests in social networks and fraudulent websites linked to each other on the web. Community detection also yields downstream use-cases such as the summarization of massive networks into smaller networks of communities. We are most interested in mining protein complexes, i.e., communities of interacting proteins, accelerating biological experiments by providing candidates for previously unknown protein complexes. Characterization of protein complexes is important, as they play essential roles in cellular functions and their disruption often leads to disease. Previous methods in community detection comprise a majority of unsupervised graph clustering strategies, which work on the assumption that communities are dense subgraphs in a network - which is not always true. Also, many community detection algorithms are in-memory and serial and do not scale to large networks. In this dissertation, we use knowledge from communities, including rich features from graph nodes, with supervised and reinforcement learning, improving on accuracies, with parallel algorithms ensuring high performance and scalability. Specifically, we work on (1) learning a community fitness function using supervised machine learning methods with AutoML; (2) a distributed algorithm for finding candidate communities using multiple heuristics; (3) learning to walk trajectories on a network leading to communities with reinforcement learning and (4) feature augmentation with graph node information, such as images and additional graph node embeddings. While we optimize our algorithms on protein complexes that have characteristics such as being overlapping in nature with different topologies, our methods are generalizable to other domains since they learn and use characteristics of communities to predict new communities. Further, in domains with limited known information, the algorithms we develop can be applied by transferring learned knowledge such as dense community fitness functions from other domains. In conclusion, we build Super.Complex, RL complex detection, and DeepSLICEM - three accurate, efficient, scalable, and generalizable community detection algorithms, that effectively utilize known community information with different machine learning methods and also present 3 evaluation measures for accurate community evaluation.

Description

LCSH Subject Headings

Citation