Loading…
Tuesday April 29, 2025 11:30am - 11:50am EDT
Jianbo Dong, Kun Qian, Pengcheng Zhang, Zhilong Zheng, Liang Chen, Fei Feng, Yichi Xu, Yikai Zhu, Gang Lu, Xue Li, Zhihui Ren, Zhicheng Wang, Bin Luo, Peng Zhang, Yang Liu, Yanqing Chen, Yu Guan, Weicheng Wang, Chaojie Yang, Yang Zhang, Man Yuan, Hanyu Zhao, Yong Li, Zihan Zhao, Shan Li, Xianlong Zeng, Zhiping Yao, Binzhang Fu, Ennan Zhai, Wei Lin, Chao Wang, and Dennis Cai, Alibaba Cloud


Despite the success of diagnosis systems in traditional cloud computing, these systems are not suitable for pinpointing faults in AI model training cloud scenarios due to the differences in computing paradigms between traditional cloud computing and model training. As one of the largest cloud providers, we present Aegis, a fault diagnosis system specifically designed for AI model training service. We share our experience in the motivation, design, and evolution of Aegis. Keeping easy-to-deploy as the primary principle, Aegis Phase- 1 started by enhancing existing general-purpose diagnosis systems. After several months of evolution, Aegis Phase-2 cogitatively chose to customize the collective communication library for sophisticated failure localization in runtime without modifying customer code. Besides the failure localization, we further equipped Aegis with the capabilities on handling performance degradation and failure checking before delivery. Aegis has been deployed in our production training cloud service for one year. Aegis decreases more than 97% of the idle time wasted by diagnosis, 84% of the training task restart count and 71% of the performance degradation.


https://www.usenix.org/conference/nsdi25/presentation/dong
Tuesday April 29, 2025 11:30am - 11:50am EDT
Liberty Ballroom

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link