DiSLR: Distributed Sampling with Limited Redundancy For Triangle Counting in Graph Streams

02/12/2018
by   Kijung Shin, et al.
0

Given a web-scale graph that grows over time, how should its edges be stored and processed on multiple machines for rapid and accurate estimation of the count of triangles? The count of triangles (i.e., cliques of size three) has proven useful in many applications, including anomaly detection, community detection, and link recommendation. For triangle counting in large and dynamic graphs, recent work has focused largely on streaming algorithms and distributed algorithms. To achieve the advantages of both approaches, we propose DiSLR, a distributed streaming algorithm that estimates the counts of global triangles and local triangles associated with each node. Making one pass over the input stream, DiSLR carefully processes and stores the edges across multiple machines so that the redundant use of computational and storage resources is minimized. Compared to its best competitors, DiSLR is (a) Accurate: giving up to 39X smaller estimation error, (b) Fast: up to 10.4X faster, scaling linearly with the number of edges in the input stream, and (c) Theoretically sound: yielding unbiased estimates with variances decreasing faster as the number of machines is scaled up.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset