Name: AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training
Start: 2025-04-29T09:00:00-0400
End: 2025-04-29T09:20:00-0400

Tuesday April 29, 2025 9:00am - 9:20am EDT

Liberty Ballroom

Guanbin Xu, Zhihao Le, Yinhe Chen, Zhiqi Lin, and Zewen Jin, University of Science and Technology of China; Youshan Miao, Microsoft Research; Cheng Li, University of Science and Technology of China; Anhui Province Key Laboratory of Biomedical Imaging and Intelligent Processing; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

The collective communication libraries are pivotal in optimizing the performance of distributed and parallel deep neural network (DNN) training. Most network optimizations are under the assumption that these libraries are well-tuned, ignoring their low-level parameter selection. In this paper, we present a novel automated tuning method AutoCCL that significantly improves communication performance without incurring additional costs. One of the primary challenges we tackle is the state explosion in searching for the optimal configuration. To overcome this, we decouple implementation-related parameters from those sensitive to the search space size and propose a divide-and-conquer algorithm, minimizing the requirement for exhaustive trials. We further propose an online tuning approach that accounts for communication-computation interference to enhance accuracy in finding optimal configurations, while hiding tuning overhead within early iterations of training jobs. We implement AutoCCL atop NCCL, a leading and widely-used communication library provided by NVIDIA. Our evaluation on both a 2-node cluster (16 A40 GPUs, intra-node NVLink, inter-node 2× 400Gbps InfiniBand) and a 4-node cluster (32 A40 GPUs, intra-node PCIe, inter-node 100Gbps InfiniBand) demonstrates that AutoCCL achieves 1.24-1.29× and 1.15-1.22× speedups on microbenchmarks compared to NCCL and another SOTA NCCL tuner, respectively, and up to 1.80× and 1.49× with concurrent computation. End-to-end evaluations on three large language models and one vision model show 1.07-1.32× improvements in periteration training time.

https://www.usenix.org/conference/nsdi25/presentation/xu-guanbin

Tuesday April 29, 2025 9:00am - 9:20am EDT
Liberty Ballroom

Track 1

NSDI '25: 22nd USENIX Symposium on Networked Systems Design and Implementation

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!