Scalable Fine-Grained Parallel Cycle Enumeration Algorithms
This paper investigates scalable parallelisation of state-of-the-art cycle enumeration algorithms by Johnson and Read-Tarjan along with their applications to temporal graphs. We provide a comprehensive theoretical analysis of various parallel versions of these algorithms and evaluate their performance on multi-core processors. We show that a straightforward coarse-grained parallelisation approach is not scalable and suffers from load imbalance issues. To eliminate the load imbalance, we modify the Johnson and the Read-Tarjan algorithms to exploit finer-grained parallelism. We show that our fine-grained parallel Read-Tarjan algorithm is theoretically work efficient – i.e., it does no more work than its serial version. However, our fine-grained parallel Johnson algorithm does not share this property. Yet, in practice, our fine-grained parallel Johnson algorithm outperforms our fine-grained parallel Read-Tarjan algorithm. In any case, both of our contributed fine-grained parallel algorithms are scalable, meaning that they can effectively utilise an increasing number of software threads, which we prove theoretically and demonstrate through extensive experiments. On a cluster of multi-core CPUs with 256 physical cores that can execute 1024 simultaneous threads, our fine-grained parallel Johnson and Read-Tarjan algorithms are respectively up to 435× and 470× faster than their single-threaded versions. On the same compute cluster, our fine-grained parallel algorithms are on average an order of magnitude faster than their coarse-grained parallel counterparts.
READ FULL TEXT