Parallel community detection on large graphs with MapReduce and GraphChi

作者:

Highlights:

摘要

Community detection from social network data gains much attention from academia and industry since it has many real-world applications. The Girvan–Newman (GN) algorithm is a divisive hierarchical clustering algorithm for community detection, which is regarded as one of the most popular algorithms. It exploits the concept of edge betweenness to divide a network into multiple communities. Though it is being widely used, it has limitations in supporting large-scale networks since it needs to calculate the shortest path between every pair of vertices in a network. In this paper, we develop two parallel versions of the GN algorithm to support large-scale networks. First, we propose a new algorithm, which we call Shortest Path Betweenness MapReduce Algorithm (SPB-MRA), that utilizes the MapReduce model. Second, we propose another new algorithm, which we call Shortest Path Betweenness Vertex-Centric Algorithm (SPB-VCA), that utilizes the vertex-centric model. An approximation technique is also developed to further speed up community detection processes. We implemented SPB-MRA using Hadoop and SPB-VCA using GraphChi, and then evaluated the performance of SPB-MRA on Amazon EC2 instances and that of SPB-VCA on a single commodity PC. The evaluation results showed that the elapsed time of SPB-MRA decreased almost linearly as the number of reducers increased, SPB-VCA outperformed SPB-MRA just on a single PC by 4–6 times, and the approximation technique introduced negligible errors.

论文关键词:Clustering, classification, and association rules,Mining methods and algorithms,Community detection,Social networks,MapReduce,Vertex-centric model

论文评审过程:Available online 8 May 2015, Version of Record 19 July 2016.

论文官网地址:https://doi.org/10.1016/j.datak.2015.05.001