Network inference with statistical guarantees
Networks arise in a huge variety of real data scenarios: starting from social networks like Facebook or user product networks in recommendation systems to protein-protein interaction networks in biological systems, etc. In this thesis, we focus on developing fast and provable algorithms for some network inference problems.
In the first part, we are interested in overlapping community detection problem under the popular Mixed Membership Stochastic Blockmodel (MMSB). We firstly establish sufficient conditions for the symmetric non-negative matrix factorization optimization to have a unique solution under MMSB, and propose a computationally efficient algorithm called GeoNMF that is provably optimal and hence consistent for a broad parameter regime. Then using the inherent geometry of MMSB, we link the inference of overlapping communities to the problem of finding corners in a noisy rotated and scaled simplex, for which consistent algorithms exist. We use this as a building block for our algorithm to infer the community memberships of each node, and provide uniform rates of convergence for the inferred community membership vector of each node in the network. As a byproduct of our analysis, we derive sharp row-wise eigenvector deviation bounds, and provide a cleaning step that improves the performance drastically for sparse networks. Our results hold over a broad parameter regime where the average degree only grows poly-logarithmically with the number of nodes. Using experiments with simulated and real datasets, we show that our method achieves better error with lower variability over competing methods, and processes real world networks of up to 100,000 nodes within tens of seconds.
For the second part, we go beyond MMSB for overlapping community detection. Notice that many existing overlapping clustering methods model each person (or word, or book) as a non-negative weighted combination of "exemplars" who belong solely to one community, with some small noise. Geometrically, each person is a point on a cone whose corners are these exemplars. This basic form encompasses the widely used MMSB of networks and its degree corrected variants, as well as topic models such as LDA. We show that a simple one-class SVM yields provably consistent parameter inference for all such models, and scales to large datasets. Experimental results on several simulated and real datasets show our algorithm (called SVM-cone) is both accurate and scalable.
The final contribution of this thesis is novel nonparametric methods for network covariate estimation. Networks with node covariates are commonplace: for example, people in a social network have interests, or product preferences, etc. If we know the covariates for some nodes, can we infer them for the remaining nodes? We provide two provably consistent methods to solve this problem. For "low-rank" latent variable models, we develop SVD-RBF, which uses the top principal components of the network in a non-parametric regression. For general models, we present CN-VEC, which constructs a similarity measure between two nodes, based on the patterns of their 2-hop neighborhoods. CN-VEC then predicts node covariates by averaging the covariates of the top-k most similar nodes using this measure. SVD-RBF is consistent for low-rank models when the average degree grows with Õ(log n), while CN-VEC is consistent for a wide range of models when the degree grows with Õ(n¹ [superscript /] ³). To our knowledge, CN-VEC is the first provably consistent method for this problem under general models. Both methods are fast, and CN-VEC is also parameter-free. Experiments on 4 simulated network models and 3 real-world datasets show the effectiveness of our algorithms compared to the state of the art.