Loading…
Monday April 28, 2025 3:00pm - 3:20pm EDT
Ruiming Lu, University of Michigan and Shanghai Jiao Tong University; Yunchi Lu and Yuxuan Jiang, University of Michigan; Guangtao Xue, Shanghai Jiao Tong University; Peng Huang, University of Michigan


Recent studies have shown that various hardware components exhibit fail-slow behavior at scale. However, the characteristics of distributed software's tolerance of such slow faults remain ill-understood. This paper presents a comprehensive study that investigates the characteristics and current practices of slow-fault tolerance in modern distributed software. We focus on the fundamentally nuanced nature of slow faults. We develop a testing pipeline to systematically introduce diverse slow faults, measure their impact under different workloads, and identify the patterns. Our study shows that even small changes can lead to dramatically different reactions. While some systems have added slow-fault handling mechanisms, they are mostly controlled by static thresholds, which can hardly accommodate the highly sensitive and dynamic characteristics. To address this gap, we design ADR, a lightweight library to use within system code and make fail-slow handling adaptive. Evaluation shows ADR significantly reduces the impact of slow faults.


https://www.usenix.org/conference/nsdi25/presentation/lu
Monday April 28, 2025 3:00pm - 3:20pm EDT
Liberty Ballroom

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link