TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Training

04/11/2023
by   William Won, et al.
0

Collective communications are an indispensable part of distributed training. Running a topology-aware collective algorithm is crucial for optimizing communication performance by minimizing congestion. Today such algorithms only exist for a small set of simple topologies, limiting the topologies employed in training clusters and handling irregular topologies due to network failures. In this paper, we propose TACOS, an automated topology-aware collective synthesizer for arbitrary input network topologies. TACOS synthesized 3.73x faster All-Reduce algorithm over baselines, and synthesized collective algorithms for 512-NPU system in just 6.1 minutes.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset