NSDI '25: 22nd USENIX Symposium on Networked Systems Design and Implementation: Full Schedule

arrow_back View All Dates

8:00am EDT

Continental Breakfast

Tuesday April 29, 2025 8:00am - 9:00am EDT

Liberty Ballroom Foyer

Tuesday April 29, 2025 8:00am - 9:00am EDT
Liberty Ballroom Foyer

8:00am EDT

Badge Pickup

Tuesday April 29, 2025 8:00am - 5:00pm EDT

Liberty Ballroom Foyer

Tuesday April 29, 2025 8:00am - 5:00pm EDT
Liberty Ballroom Foyer

9:00am EDT

AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training

Tuesday April 29, 2025 9:00am - 9:20am EDT

Liberty Ballroom

Guanbin Xu, Zhihao Le, Yinhe Chen, Zhiqi Lin, and Zewen Jin, University of Science and Technology of China; Youshan Miao, Microsoft Research; Cheng Li, University of Science and Technology of China; Anhui Province Key Laboratory of Biomedical Imaging and Intelligent Processing; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center

The collective communication libraries are pivotal in optimizing the performance of distributed and parallel deep neural network (DNN) training. Most network optimizations are under the assumption that these libraries are well-tuned, ignoring their low-level parameter selection. In this paper, we present a novel automated tuning method AutoCCL that significantly improves communication performance without incurring additional costs. One of the primary challenges we tackle is the state explosion in searching for the optimal configuration. To overcome this, we decouple implementation-related parameters from those sensitive to the search space size and propose a divide-and-conquer algorithm, minimizing the requirement for exhaustive trials. We further propose an online tuning approach that accounts for communication-computation interference to enhance accuracy in finding optimal configurations, while hiding tuning overhead within early iterations of training jobs. We implement AutoCCL atop NCCL, a leading and widely-used communication library provided by NVIDIA. Our evaluation on both a 2-node cluster (16 A40 GPUs, intra-node NVLink, inter-node 2× 400Gbps InfiniBand) and a 4-node cluster (32 A40 GPUs, intra-node PCIe, inter-node 100Gbps InfiniBand) demonstrates that AutoCCL achieves 1.24-1.29× and 1.15-1.22× speedups on microbenchmarks compared to NCCL and another SOTA NCCL tuner, respectively, and up to 1.80× and 1.49× with concurrent computation. End-to-end evaluations on three large language models and one vision model show 1.07-1.32× improvements in periteration training time.

https://www.usenix.org/conference/nsdi25/presentation/xu-guanbin

Tuesday April 29, 2025 9:00am - 9:20am EDT
Liberty Ballroom

Track 1

9:00am EDT

Pineapple: Unifying Multi-Paxos and Atomic Shared Registers

Tuesday April 29, 2025 9:00am - 9:20am EDT

NSDI '25: 22nd USENIX Symposium on Networked Systems Design and Implementation

8:00am EDT

8:00am EDT

9:00am EDT

9:00am EDT

9:20am EDT

9:20am EDT

9:40am EDT

9:40am EDT

10:00am EDT

10:00am EDT

10:20am EDT

10:50am EDT

10:50am EDT

11:10am EDT

11:10am EDT

11:30am EDT

11:30am EDT

11:50am EDT

11:50am EDT

12:10pm EDT

2:00pm EDT

2:00pm EDT

2:20pm EDT

2:20pm EDT

2:40pm EDT

2:40pm EDT

3:00pm EDT

3:00pm EDT

3:20pm EDT

3:50pm EDT

3:50pm EDT

4:10pm EDT

4:10pm EDT

4:30pm EDT

4:30pm EDT

4:50pm EDT

5:10pm EDT

6:00pm EDT

7:30pm EDT

Meta