Loading…
Monday April 28, 2025 5:10pm - 5:30pm EDT
Xizheng Wang, Alibaba Cloud and Tsinghua University; Qingxu Li, Yichi Xu, and Gang Lu, Alibaba Cloud; Dan Li, Tsinghua University; Li Chen, Zhongguancun Laboratory; Heyang Zhou, Alibaba Cloud; Linkang Zheng, Alibaba Cloud and South China University of Technology; Sen Zhang, Yikai Zhu, Yang Liu, Pengcheng Zhang, Kun Qian, Kunling He, Jiaqi Gao, and Ennan Zhai, Alibaba Cloud; Dennis Cai, Alibaba Group; Binzhang Fu, Alibaba Cloud


The large number of GPUs required for a single LLM training significantly hinders the validation of new designs, tunings, and optimizations, calling for the occurrence of efficient simulators. Existing simulators, however, only target a specific granularity of the entire training, intrinsically leading to imprecision. This paper presents SimAI, a unified simulator aiming at precisely and efficiently simulating the LLM training procedure at scale. Through selective and high-fidelity integration of the training frameworks, the kernel computation, and the collective communication library into the simulating procedure, SimAI achieves high precision in simulations. SimAI further conducts multi-thread acceleration and implements lock-free global context-sharing to accelerate the execution speed. The effectiveness of SimAI is validated by its performance results, which show an average of 98.1% alignment to real-world results under various test scenarios and affirm its robustness and adaptability from small-scale labs to large-scale industrial environments. SimAI delivers meaningful guidelines for new host designs and parameter settings, directly benefiting in-production LLM training. We also share experiences and lessons learned during the evolution of SimAI. SimAI is open sourced at https://github.com/aliyun/SimAI.


https://www.usenix.org/conference/nsdi25/presentation/wang-xizheng-simai
Monday April 28, 2025 5:10pm - 5:30pm EDT
Liberty Ballroom

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link