Loading…
Monday April 28, 2025 3:50pm - 4:10pm EDT
Fei Gui, Tsinghua University; BNRist; Tsinghua Shenzhen International Graduate School; Kaihui Gao and Li Chen, Zhongguancun Laboratory; Dan Li, Tsinghua University; Vincent Liu, University of Pennsylvania; Ran Zhang and Hongbing Yang, Zhongguancun Laboratory; Dian Xiong, Tsinghua University


The rapid expansion of large language models (LLMs) requires the development of extensive GPU clusters, with companies deploying clusters with tens to hundreds of thousands of GPUs. This growth significantly expands the design space for LLM training systems, requiring thorough exploration of different parallelization strategies, communication parameters, congestion control, fabric topology, etc. Current methods require up to 10k simulation experiments to identify optimal configurations, with inadequate exploration leading to significant degradation of training performance.

In this paper, we tackle the overlooked problem of efficiently conducting parallel simulation experiments for design space exploration. Our analysis and experiments show that Single-process Multi-experiment (SPME) achieves superior performance by reducing scheduling overhead and optimizing resource utilization, yet remains insufficient for current AI cluster scales. To enhance SPME’s efficacy, we introduce Multiverse, a novel GPU-based AI training simulator. Multiverse leverages the computing throughput of GPUs efficiently with optimizations such as a pull-based synchronization, highfidelity intra-server communication, and a kernel-fusion technique. Extensive experiments validate the accuracy and efficiency of Multiverse, demonstrating less than 3.0% discrepancy with real-world LLM training on clusters of up to 54,000 GPUs, achieving 43.1−73.2X speedup over state-of-the-art CPU-based simulators in various use cases.


https://www.usenix.org/conference/nsdi25/presentation/gui
Monday April 28, 2025 3:50pm - 4:10pm EDT
Liberty Ballroom

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link