Xizheng Wang,
Alibaba Cloud and Tsinghua University; Qingxu Li, Yichi Xu, and Gang Lu,
Alibaba Cloud; Dan Li,
Tsinghua University; Li Chen,
Zhongguancun Laboratory; Heyang Zhou,
Alibaba Cloud; Linkang Zheng,
Alibaba Cloud and South China University of Technology; Sen Zhang, Yikai Zhu, Yang Liu, Pengcheng Zhang, Kun Qian, Kunling He, Jiaqi Gao, and Ennan Zhai,
Alibaba Cloud; Dennis Cai,
Alibaba Group; Binzhang Fu,
Alibaba Cloud The large number of GPUs required for a single LLM training significantly hinders the validation of new designs, tunings, and optimizations, calling for the occurrence of efficient simulators. Existing simulators, however, only target a specific granularity of the entire training, intrinsically leading to imprecision. This paper presents SimAI, a unified simulator aiming at precisely and efficiently simulating the LLM training procedure at scale. Through selective and high-fidelity integration of the training frameworks, the kernel computation, and the collective communication library into the simulating procedure, SimAI achieves high precision in simulations. SimAI further conducts multi-thread acceleration and implements lock-free global context-sharing to accelerate the execution speed. The effectiveness of SimAI is validated by its performance results, which show an average of 98.1% alignment to real-world results under various test scenarios and affirm its robustness and adaptability from small-scale labs to large-scale industrial environments. SimAI delivers meaningful guidelines for new host designs and parameter settings, directly benefiting in-production LLM training. We also share experiences and lessons learned during the evolution of SimAI. SimAI is open sourced at
https://github.com/aliyun/SimAI.
https://www.usenix.org/conference/nsdi25/presentation/wang-xizheng-simai