Name: OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud
Start: 2025-04-29T09:20:00-0400
End: 2025-04-29T09:40:00-0400

Tuesday April 29, 2025 9:20am - 9:40am EDT

Liberty Ballroom

Ertza Warraich, Purdue University; Omer Shabtai and Khalid Manaa, Nvidia; Shay Vargaftik, VMware Research; Yonatan Piasetzky and Matty Kadosh, Nvidia; Lalith Suresh, Feldera; Muhammad Shahbaz, University of Michigan

We present OptiReduce, a new collective-communication system for the cloud with bounded, predictable completion times for deep-learning jobs in the presence of varying computation (stragglers) and communication (congestion and gradient drops) variabilities. OptiReduce exploits the inherent resiliency and the stochastic nature of distributed deep-learning (DDL) training and fine-tuning to work with approximated (or lost) gradients—providing an efficient balance between (tail) performance and the resulting accuracy of the trained models.

Exploiting this domain-specific characteristic of DDL, OptiReduce introduces (1) mechanisms (e.g., unreliable bounded transport with adaptive timeout) to improve the DDL jobs’ tail execution time, and (2) strategies (e.g., Transpose AllReduce and Hadamard Transform) to mitigate the impact of gradient drops on model accuracy. Our evaluation shows that OptiReduce achieves 70% and 30% faster time-to-accuracy (TTA), on average, when operating in shared, cloud environments (e.g., CloudLab) compared to Gloo and NCCL, respectively.

https://www.usenix.org/conference/nsdi25/presentation/warraich

Tuesday April 29, 2025 9:20am - 9:40am EDT
Liberty Ballroom

Track 1

NSDI '25: 22nd USENIX Symposium on Networked Systems Design and Implementation

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!