NSDI '25: 22nd USENIX Symposium on Networked Systems Design and Implementation: Full Schedule

arrow_back View All Dates

7:30am EDT

Continental Breakfast

Monday April 28, 2025 7:30am - 8:55am EDT

Liberty Ballroom Foyer

Monday April 28, 2025 7:30am - 8:55am EDT
Liberty Ballroom Foyer

7:30am EDT

Badge Pickup

Monday April 28, 2025 7:30am - 5:00pm EDT

Liberty Ballroom Foyer

Monday April 28, 2025 7:30am - 5:00pm EDT
Liberty Ballroom Foyer

8:55am EDT

Opening Remarks and Awards

Monday April 28, 2025 8:55am - 9:10am EDT

Liberty Ballroom

Program Co-Chairs: Theophilus A. Benson, Carnegie Mellon University; Radhika Niranjan Mysore, VMware Research Group

Monday April 28, 2025 8:55am - 9:10am EDT
Liberty Ballroom

9:10am EDT

PRED: Performance-oriented Random Early Detection for Consistently Stable Performance in Datacenters

Monday April 28, 2025 9:10am - 9:30am EDT

Liberty Ballroom

Xinle Du, Huawei Technologies; Tong Li, Renmin University of China; Guangmeng Zhou, Zhuotao Liu, Hanlin Huang, and Xiangyu Gao, Tsinghua University; Mowei Wang and Kun Tan, Huawei Technologies; Ke Xu, Tsinghua University

For decades, Random Early Detection (RED) has been integrated into datacenter switches as a fundamental Active Queue Management (AQM). Accurate configuration of RED parameters is crucial to achieving high throughput and low latency. However, due to the highly dynamic nature of workloads in datacenter networks, maintaining consistently high performance with statically configured RED thresholds poses a challenge. Prior art applies reinforcement learning to predict proper thresholds, but their real-world deployment has been hindered by poor tail performance caused by instability. In this paper, we propose PRED, a novel system that enables automatic and stable RED parameter adjustment in response to traffic dynamics. PRED uses two loosely coupled systems, Flow Concurrent Stabilizer (FCS) and Queue Length Adjuster (QLA), to overcome the challenges of dynamically setting RED parameters to adapt to the ever-changing traffic pattern. We perform extensive evaluations on our physical testbed and large-scale simulations. The results demonstrate that PRED can keep up with the real-time network dynamics generated by realistic workloads. For instance, compared with the static-threshold-based methods, PRED keeps 66%lower switch queue length and obtains up to 80% lower Flow Completion Time (FCT). Compared with the state-of-the-art learning-based method, PRED reduces the tail FCT by 34%.

https://www.usenix.org/conference/nsdi25/presentation/du

Monday April 28, 2025 9:10am - 9:30am EDT
Liberty Ballroom

Track 1

9:10am EDT

Enabling Silent Telemetry Data Transmission with InvisiFlow

Monday April 28, 2025 9:10am - 9:30am EDT

Independence Ballroom

Yinda Zhang, University of Pennsylvania; Liangcheng Yu, University of Pennsylvania and Microsoft Research; Gianni Antichi, Politecnico di Milano and Queen Mary University of London; Ran Ben Basat, University College London; Vincent Liu, University of Pennsylvania

Network applications from traffic engineering to path tracing often rely on the ability to transmit fine-grained telemetry data from network devices to a set of collectors. Unfortunately, prior work has observed—and we validate—that existing transmission methods for such data can result in significant overhead to user traffic and/or loss of telemetry data, particularly when the network is heavily loaded.

In this paper, we introduce InvisiFlow, a novel communication substrate to collect network telemetry data, silently. In contrast to previous systems that always push telemetry packets to collectors based on the shortest path, InvisiFlow dynamically seeks out spare network capacity by leveraging opportunistic sending and congestion gradients, thus minimizing both the loss rate of telemetry data and overheads on user traffic. In a FatTree topology, InvisiFlow can achieve near-zero loss rate even under high-load scenarios (around 33.8× lower loss compared to the state-of-the-art transmission methods used by systems like Everflow and Planck).

https://www.usenix.org/conference/nsdi25/presentation/zhang-yinda

Monday April 28, 2025 9:10am - 9:30am EDT
Independence Ballroom

Track 2

9:30am EDT

Rajomon: Decentralized and Coordinated Overload Control for Latency-Sensitive Microservices

Monday April 28, 2025 9:30am - 9:50am EDT

Liberty Ballroom

Jiali Xing, Akis Giannoukos, Paul Loh, Shuyue Wang, and Justin Qiu, University of Pennsylvania; Henri Maxime Demoulin, DBOS, Inc; Konstantinos Kallas, University of California, Los Angeles; Benjamin C. Lee, University of Pennsylvania

Microservices are increasingly central for cloud applications due to their flexibility and support for rapid integration and deployment. However, applications often experience overload or sudden traffic surges that exceed service capacity, resulting in increased latency or service failures. Moreover, microservices are decentralized, interdependent, and multiplexed, exacerbating risks from overload.

We present RAJOMON, a market-based overload control system for large microservice graphs. RAJOMON controls overload through distributed rate-limiting and load shedding. Clients attach tokens to requests and services charge a price for each API, dropping requests with insufficient tokens. Tokens and prices propagate through the entire call graph, piggybacking on requests and responses. Thus, RAJOMON is the first decentralized, end-to-end overload control system.

We implement and evaluate RAJOMON on a setup of up to 140 cores and on a variety of applications from academia and industry. Experiments indicate RAJOMON protects microservice goodput and tail latency from substantial demand spikes, even in the case of mixed request types and deeper service graphs. For high-load scenarios, RAJOMON reduces tail latency by 78% and increases goodput by 45% when compared against state-of-the-art overload control for microservices.

https://www.usenix.org/conference/nsdi25/presentation/xing

Monday April 28, 2025 9:30am - 9:50am EDT
Liberty Ballroom

Track 1

9:30am EDT

Unlocking ECMP Programmability for Precise Traffic Control

Monday April 28, 2025 9:30am - 9:50am EDT

Independence Ballroom

Yadong Liu, Tencent; Yunming Xiao, University of Michigan; Xuan Zhang, Weizhen Dang, Huihui Liu, Xiang Li, and Zekun He, Tencent; Jilong Wang, Tsinghua University; Aleksandar Kuzmanovic, Northwestern University; Ang Chen, University of Michigan; Congcong Miao, Tencent

ECMP (equal-cost multi-path) has become a fundamental mechanism in data centers, which distributes flows along multiple equivalent paths based on their hash values. Randomized distribution optimizes for the aggregate case, spreading load across flows over time. However, there exists a class of important Precise Traffic Control (PTC) tasks that are at odds with ECMP randomness. For instance, if an end host perceives that its flows are traversing a problematic switch/link, it often needs to change their paths before a fix can be rolled out. With randomized hashing, existing solutions resort to modifying flow tuples; since hashing mechanisms are unknown and they vary across switches/vendors, it may take many trials before yielding a new path. Many other similar cases exist where precise and timely response is critical to the network.

We propose programmable ECMP (P-ECMP), a programming model, compiler, and runtime that provides precise traffic control. P-ECMP leverages an oft-ignored feature, ECMP groups, which allows for a constrained set of capabilities that are nonetheless sufficiently expressive for our tasks. An operator supplies high-level descriptions of their topology and policies, and our compiler generates PTC configurations for each switch. End hosts can reconfigure specific flows to use different PTC policies precisely and quickly, addressing a range of important use cases. We have evaluated P-ECMP using simulation at scale, and deployed one use case to a real-world data center that serves live user traffic.

https://www.usenix.org/conference/nsdi25/presentation/liu-yadong

Monday April 28, 2025 9:30am - 9:50am EDT
Independence Ballroom

Track 2

9:50am EDT

Learnings from Deploying Network QoS Alignment to Application Priorities for Storage Services

Monday April 28, 2025 9:50am - 10:10am EDT

Liberty Ballroom

Matthew Buckley and Parsa Pazhooheshy, Google and University of Toronto; Z. Morley Mao, Nandita Dukkipati, Hamid Hajabdolali Bazzaz, Priyaranjan Jha, Yingjie Bi, and Steve Middlekauff, Google; Yashar Ganjali, University of Toronto

To ensure that application network traffic is prioritized correctly within data center networks, it is critical to align the configuration of network QoS in packets to the intended priority of the application. These QoS configurations, typically encoded in the DSCP bits in the IP header, are interpreted by network switches and routers to determine the resources such as buffer space and scheduling priorities, for network traffic. Conceptually, it appears fairly straightforward to map the application priorities within data center networks to network QoS configurations, as long as the mapping is well defined. In this work, we describe our experience of aligning network QoS settings for intra-cluster storage traffic to application priorities on a per-RPC basis for a large data center network, with well-defined static mappings from priorities to QoS traffic classes. We describe some unexpected insights learned from the deployment experiences, e.g., downgrading traffic to use a lower QoS does not always imply worse network latency due to over-used QoS bands in the network. We also share some challenges encountered along the way to reach the goal of a fleet-wide deployment, including the concerns of potential performance regressions due to QoS downgrades. These lessons provide guidance on the use of a QoS-based scheduling strategy to meet service guarantees and can be deployed to networks of any scale.

https://www.usenix.org/conference/nsdi25/presentation/buckley

Monday April 28, 2025 9:50am - 10:10am EDT
Liberty Ballroom

Track 1

9:50am EDT

Enabling Portable and High-Performance SmartNIC Programs with Alkali

Monday April 28, 2025 9:50am - 10:10am EDT

Independence Ballroom

Jiaxin Lin, UT Austin; Zhiyuan Guo, UCSD; Mihir Shah, NVIDIA; Tao Ji, Microsoft; Yiying Zhang, UCSD; Daehyeok Kim and Aditya Akella, UT Austin

Trends indicate that emerging SmartNICs, either from different vendors or generations from the same vendor, exhibit substantial differences in hardware parallelism and memory interconnects. These variations make porting programs across NICs highly complex and time-consuming, requiring programmers to significantly refactor code for performance based on each target NIC’s hardware characteristics.

We argue that an ideal SmartNIC compilation framework should allow developers to write target-independent programs, with the compiler automatically managing cross-NIC porting and performance optimization. We present such a framework, Alkali, that achieves this by (1) proposing a new intermediate representation for building flexible compiler infrastructure for multiple NIC targets and (2) developing a new iterative parallelism optimization algorithm that automatically ports and parallelizes the input programs based on the target NIC’s hardware characteristics.

Experiments across a wide range of NIC applications demonstrate that Alkali enables developers to easily write portable, high-performance NIC programs. Our compiler optimization passes can automatically port these programs and make them run efficiently across all targets, achieving performance within 9.8% of hand-tuned expert implementations.

https://www.usenix.org/conference/nsdi25/presentation/lin-jiaxin

Monday April 28, 2025 9:50am - 10:10am EDT
Independence Ballroom

Track 2

10:10am EDT

DISC: Backpressure Mitigation In Multi-tier Applications With Distributed Shared Connection

Monday April 28, 2025 10:10am - 10:30am EDT

Liberty Ballroom

Brice Ekane and Djob Mvondo, Univ. Rennes, Inria, CNRS, IRISA, France; Renaud Lachaize, Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France; Yérom-David Bromberg, Univ. Rennes, Inria, CNRS, IRISA, France; Alain Tchana, Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG, 38000 Grenoble, France; Daniel Hagimont, IRIT, Université de Toulouse, CNRS, Toulouse INP, UT3 Toulouse, France

Most data-center applications are based on a multi-tier architecture, involving either coarse-grained software components (e.g., traditional 3-tier web applications) or fine-grained ones (e.g., microservices). Such applications are prone to the backpressure problem, which introduces a strong performance coupling between tiers, thus degrading scalability and resource consumption. This problem is due to the fact that, on the response path towards the initial client, a significant fraction of the payloads in the messages exchanged between tiers correspond to “final” data that are simply relayed (i.e., without further modifications) from a backend tier such as a database. This traffic results in additional pressure on the intermediate and frontend tiers.

To address this problem, we introduce DISC, a system allowing several tiers within a multi-tier chain to jointly act as endpoints of the same TCP connection. This enables the selective bypass of one or several tiers on the response path. Unlike existing solutions, DISC is (1) flexible — it accommodates arbitrary multi-tier topologies and heterogeneous application-level protocols, (2) fine-grained — it allows multiple tiers to be involved in the generation and emission of a given response message (e.g., to decouple the network path of the response headers and footers from the path of the response body), (3) and non-intrusive — it requires only minor and localized/modular modifications to the code base of legacy applications and is transparent for external clients. Evaluation results with several micro- and macro-benchmarks show that DISC can reduce the cumulative CPU load on servers by up to 41.5%, decrease the average and tail latencies respectively by up to 74.1% and 5.71×, and also improve the request rate by up to 45%.

https://www.usenix.org/conference/nsdi25/presentation/ekane

Monday April 28, 2025 10:10am - 10:30am EDT
Liberty Ballroom

Track 1

10:10am EDT

Scaling IP Lookup to Large Databases using the CRAM Lens

Monday April 28, 2025 10:10am - 10:30am EDT

Independence Ballroom

Robert Chang and Pradeep Dogga, University of California, Los Angeles; Andy Fingerhut, Cisco Systems; Victor Rios and George Varghese, University of California, Los Angeles

Wide-area scaling trends require new approaches to Internet Protocol (IP) lookup, enabled by modern networking chips such as Intel Tofino, AMD Pensando, and Nvidia BlueField, which provide substantial ternary content-addressable memory (TCAM) and static random-access memory (SRAM). However, designing and evaluating scalable algorithms for these chips is challenging due to hardware-level constraints. To address this, we introduce the CRAM (CAM+RAM) lens, a framework that combines a formal model for evaluating algorithms on modern network processors with a set of optimization idioms. We demonstrate the effectiveness of CRAM by designing and evaluating three new IP lookup schemes: RESAIL, BSIC, and MashUp. RESAIL enables Tofino-2 to scale to 2.25 million IPv4 prefixes—likely sufficient for the next decade—while a pure TCAM approach supports only 250k prefixes, just 27% of the current global IPv4 routing table. Similarly, BSIC scales to 390k IPv6 prefixes on Tofino-2, supporting 3.2 times as many prefixes as a pure TCAM implementation. In contrast, existing state-of-the-art algorithms, SAIL for IPv4 and Hi-BST for IPv6, scale to considerably smaller sizes on Tofino-2.

https://www.usenix.org/conference/nsdi25/presentation/chang

Monday April 28, 2025 10:10am - 10:30am EDT
Independence Ballroom

Track 2

10:30am EDT

Coffee and Tea Break

Monday April 28, 2025 10:30am - 11:00am EDT

Liberty Ballroom Foyer

Monday April 28, 2025 10:30am - 11:00am EDT
Liberty Ballroom Foyer

11:00am EDT

Quicksand: Harnessing Stranded Datacenter Resources with Granular Computing

Monday April 28, 2025 11:00am - 11:20am EDT

Liberty Ballroom

Zhenyuan Ruan, MIT CSAIL; Shihang Li, Brown University; Kaiyan Fan, MIT CSAIL; Seo Jin Park, University of Southern California; Marcos K. Aguilera, VMware Research by Broadcom; Adam Belay, MIT CSAIL; Malte Schwarzkopf, Brown University

Datacenters today waste CPU and memory, as resources demanded by applications often fail to match the resources available on machines. This leads to stranded resources because one resource that runs out prevents placing additional applications that could consume the other resources. Unusable stranded resources result in reduced utilization of servers, and wasted money and energy.

Quicksand is a new framework and runtime system that unstrands resources by providing developers with familiar, high-level abstractions (e.g., data structures, batch computing). Internally Quicksand decomposes them into resource proclets, granular units that each primarily consume resources of one type. Inspired by recent granular programming models, Quicksand decouples consumption of resources as much as possible. It splits, merges, and migrates resource proclets in milliseconds, so it can use resources on any machine, even if available only briefly.

Evaluation of our prototype with four applications shows that Quicksand uses stranded resources effectively; that Quicksand reacts to changing resource availability and demand within milliseconds, increasing utilization; and that porting applications to Quicksand requires moderate effort.

https://www.usenix.org/conference/nsdi25/presentation/ruan

Monday April 28, 2025 11:00am - 11:20am EDT
Liberty Ballroom

Track 1

11:00am EDT

On Temporal Verification of Stateful P4 Programs

Monday April 28, 2025 11:00am - 11:20am EDT

Independence Ballroom

Delong Zhang, Chong Ye, and Fei He, School of Software, BNRist, Tsinghua University, Beijing 100084, China; Key Laboratory for Information System Security, MoE, China

Stateful P4 programs offload network states from the control plane to the data plane, enabling unprecedented network programmability. However, existing P4 verifiers overapproximate the stateful nature of P4 programs and are inherently inadequate for verifying network functions that require stateful decision-making.

To overcome this limitation, this paper introduces an innovative approach to verify P4 programs while accounting for their stateful feature. We propose a specification language named P4LTL, tailored for describing temporal properties of stateful P4 programs at the packet processing level. Additionally, we introduce a novel concept called the Büchi transaction, representing the product of the P4 program and the P4LTL specification. The P4 program verification problem can be reduced to determining the existence of any fair and feasible trace within the Büchi transaction. To the best of our knowledge, our approach represents the first endeavor in temporal verification of stateful P4 programs at the packet processing level. We implemented a prototype tool called p4tv. Evaluation results demonstrate p4tv’s effectiveness and efficiency in temporal verification of stateful P4 programs.

https://www.usenix.org/conference/nsdi25/presentation/zhang-delong

Monday April 28, 2025 11:00am - 11:20am EDT
Independence Ballroom

Track 2

11:20am EDT

Beehive: A Scalable Disaggregated Memory Runtime Exploiting Asynchrony of Multithreaded Programs

Monday April 28, 2025 11:20am - 11:40am EDT

Liberty Ballroom

Quanxi Li, Hong Huang, Ying Liu, and Yanwen Xia, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Jie Zhang, Peking University; Mosong Zhou, Huawei Cloud; Xiaobing Feng and Huimin Cui, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Quan Chen, Shanghai Jiao Tong University; Yizhou Shan, Huawei Cloud; Chenxi Wang, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences

The Microsecond (µs)-scale I/O fabrics raise a tension between the programming productivity and performance, especially in disaggregated memory systems. The multithreaded synchronous programming model is popular in developing memory-disaggregated applications due to its intuitive program logic. However, our key insight is that although thread switching can effectively mitigate µs-scale latency, it leads to poor data locality and non-trivial scheduling overhead, leaving significant opportunities to improve the performance further. This paper proposes a memory-disaggregated framework, Beehive, which improves the remote access throughput by exploiting the asynchrony within each thread. To improve the programming usability, Beehive allows the programmers to develop applications in the conventional multithreaded synchronous model and automatically transforms the code into pararoutine (a newly proposed computation and scheduling unit) based asynchronous code via the Rust compiler. Beehive outperforms the state-of-the-art memory-disaggregated frameworks, i.e., Fastswap, Hermit, and AIFM, by 4.26×, 3.05×, and 1.58× on average.

https://www.usenix.org/conference/nsdi25/presentation/li-quanxi

Monday April 28, 2025 11:20am - 11:40am EDT
Liberty Ballroom

Track 1

11:20am EDT

NDD: A Decision Diagram for Network Verification

Monday April 28, 2025 11:20am - 11:40am EDT

Independence Ballroom

Zechun Li, Peng Zhang, and Yichi Zhang, Xi'an Jiaotong University; Hongkun Yang, Google

State-of-the-art network verifiers extensively use Binary Decision Diagram (BDD) as the underlying data structure to represent the network state and equivalence classes. Despite its wide usage, we find BDD is not ideal for network verification: verifiers need to handle the low-level computation of equivalence classes, and still face scalability issues when the network state has a lot of bits.

To this end, this paper introduces Network Decision Diagram (NDD), a new decision diagram customized for network verification. In a nutshell, NDD wraps BDD with another layers of decision diagram, such that each NDD node represents a field of the network, and each edge is labeled with a BDD encoding the values of that field. We designed and implemented a library for NDD, which features a native support for equivalence classes, and higher efficiency in memory and computation. Using the NDD library, we re-implemented five BDD-based verifiers with minor modifications to their original codes, and observed a 100× reduction in memory cost and 100× speedup. This indicates that NDD provides a drop-in replacement of BDD for network verifiers.

https://www.usenix.org/conference/nsdi25/presentation/li-zechun

Monday April 28, 2025 11:20am - 11:40am EDT
Independence Ballroom

Track 2

11:40am EDT

Making Serverless Pay-For-Use a Reality with Leopard

Monday April 28, 2025 11:40am - 12:00pm EDT

Liberty Ballroom

Tingjia Cao, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Tyler Caraza-Harter, University of Wisconsin-Madison

Serverless computing has gained traction due to its event-driven architecture and “pay for use” (PFU) billing model. However, our analysis reveals that current billing practices do not align with true resource consumption. This paper challenges the prevailing SLIM (static, linear, interactive-only model) assumptions that underpin existing billing models, demonstrating that current billing does not realize PFU for realistic workloads. We introduce the Nearly Pay-for-Use (NPFU) billing model, which accommodates varying CPU and memory demands, spot cores, and preemptible memory. We also introduce Leopard, an NPFU-based serverless platform that integrates billing awareness into several major subsystems: CPU scheduler, OOM killer, admission controller, and cluster scheduler. Experimental results indicate that Leopard benefits both providers and users, increasing throughput by more than 2x and enabling cost reductions.

https://www.usenix.org/conference/nsdi25/presentation/cao

Monday April 28, 2025 11:40am - 12:00pm EDT
Liberty Ballroom

Track 1

11:40am EDT

Smart Casual Verification of the Confidential Consortium Framework

Monday April 28, 2025 11:40am - 12:00pm EDT

Independence Ballroom

Heidi Howard, Markus A. Kuppe, Edward Ashton, and Amaury Chamayou, Azure Research, Microsoft; Natacha Crooks, Azure Research, Microsoft and UC Berkeley

The Confidential Consortium Framework (CCF) is an open-source platform for developing trustworthy and reliable cloud applications. CCF powers Microsoft's Azure Confidential Ledger service and as such it is vital to build confidence in the correctness of CCF's design and implementation. This paper reports our experiences applying smart casual verification to validate the correctness of CCF's novel distributed protocols, focusing on its unique distributed consensus protocol and its custom client consistency model. We use the term smart casual verification to describe our hybrid approach, which combines the rigor of formal specification and model checking with the pragmatism of automated testing, in our case binding the formal specification in TLA+ to the C++ implementation. While traditional formal methods approaches require substantial buy-in and are often one-off efforts by domain experts, we have integrated our smart casual verification approach into CCF's CI pipeline, allowing contributors to continuously validate CCF as it evolves. We describe the challenges we faced in applying smart casual verification to a complex existing codebase and how we overcame them to find six subtle bugs in the design and implementation before they could impact production.

https://www.usenix.org/conference/nsdi25/presentation/howard

Monday April 28, 2025 11:40am - 12:00pm EDT
Independence Ballroom

Track 2

12:00pm EDT

GRANNY: Granular Management of Compute-Intensive Applications in the Cloud

Monday April 28, 2025 12:00pm - 12:20pm EDT

Liberty Ballroom

Carlos Segarra, Simon Shillaker, Guo Li, and Eleftheria Mappoura, Imperial College London; Rodrigo Bruno, INESC-ID, Instituto Superior Técnico, University of Lisbon; Lluís Vilanova and Peter Pietzuch, Imperial College London

Parallel applications are typically implemented using multi-threading (with shared memory, e.g., OpenMP) or multi-processing (with message passing, e.g., MPI). While it seems attractive to deploy such applications in cloud VMs, existing cloud schedulers fail to manage these applications efficiently: they cannot scale multi-threaded applications dynamically when more vCPUs in a VM become available, and they cause fragmentation over time because of the static allocation of multi-process applications to VMs.

We describe GRANNY, a new distributed runtime that enables the fine-granular management of multi-threaded/process applications in cloud environments. GRANNY supports the vertical scaling of multi-threaded applications within a VM and the horizontal migration of multi-process applications between VMs. GRANNY achieves both through a single WebAssembly-based execution abstraction: Granules can execute application code with thread or process semantics and allow for efficient snapshotting. GRANNY scales up applications by adding more Granules at runtime, and de-fragments applications by migrating Granules between VMs. In both cases, it launches new Granules from snapshots efficiently. We evaluate GRANNY with dynamic scheduling policies and show that, compared to current schedulers, it reduces the makespan for OpenMP workloads by up to 60% and the fragmentation for MPI workloads by up to 25%.

https://www.usenix.org/conference/nsdi25/presentation/segarra

Monday April 28, 2025 12:00pm - 12:20pm EDT
Liberty Ballroom

Track 1

12:00pm EDT

VEP: A Two-stage Verification Toolchain for Full eBPF Programmability

Monday April 28, 2025 12:00pm - 12:20pm EDT

Independence Ballroom

Xiwei Wu, Yueyang Feng, Tianyi Huang, Xiaoyang Lu, Shengkai Lin, Lihan Xie, Shizhen Zhao, and Qinxiang Cao, Shanghai Jiao Tong University

Extended Berkely Package Filter (eBPF) is a revolutionary technology that can safely and efficiently extend kernel capabilities. It has been widely used in networking, tracing, security, and more. However, existing eBPF verifiers impose strict constraints, often requiring repeated modifications to eBPF programs to pass verification. To enhance programmability, we introduce VEP, an annotation-guided eBPF program verification toolchain. VEP consists of three components: VEP-C, a verifier for annotated eBPF-C programs; VEP-compiler, a compiler targeting annotated eBPF bytecode; and VEP-eBPF, a lightweight bytecode-level proof checker. VEP allows users to verify the correctness of their programs with appropriate annotations, thus enabling full programmability. Our experimental results demonstrate that VEP addresses the limitations of existing verifiers, i.e. the Linux verifier and PREVAIL, and provides a more flexible and automated approach to kernel security.

https://www.usenix.org/conference/nsdi25/presentation/wu-xiwei

Monday April 28, 2025 12:00pm - 12:20pm EDT
Independence Ballroom

Track 2

12:20pm EDT

Symposium Luncheon and Test of Time Award Presentation

Monday April 28, 2025 12:20pm - 2:00pm EDT

Franklin Hall A

Monday April 28, 2025 12:20pm - 2:00pm EDT
Franklin Hall A

Activities

2:00pm EDT

MeshTest: End-to-End Testing for Service Mesh Traffic Management

Monday April 28, 2025 2:00pm - 2:20pm EDT

Liberty Ballroom

Naiqian Zheng, Tianshuo Qiao, Xuanzhe Liu, and Xin Jin, Peking University

We present MeshTest, the first end-to-end testing framework for traffic management of service mesh. The key idea of MeshTest is to automatically generate input configurations with end-to-end semantics, and then create real test request suites on each input. There are two technical challenges. First, the input space of service mesh configurations is large and complex. The input configurations should be carefully orchestrated to form end-to-end service flow paths. Second, the abstract output network behavior cannot be directly checked for correctness, and we need to generate a set of real requests that are capable of checking possible behaviors. To address these challenges, we model the service flows of traffic management in service mesh, and propose a novel Service Flow Exploration technique to enumerate all possible configuration resources and interactions between them in the input configuration. We design and implement MeshTest, which contains an automatic input configuration generator based on Service Flow Exploration and a Service Mesh Oracle which leverages formal methods to generate test request suites. MeshTest has found 23 new bugs (19 confirmed and 10 fixed) in two popular service mesh systems, Istio and Linkerd.

https://www.usenix.org/conference/nsdi25/presentation/zheng-naiqian

Monday April 28, 2025 2:00pm - 2:20pm EDT
Liberty Ballroom

Track 1

2:00pm EDT

Pyrrha: Congestion-Root-Based Flow Control to Eliminate Head-of-Line Blocking in Datacenter

Monday April 28, 2025 2:00pm - 2:20pm EDT

Independence Ballroom

Kexin Liu, Zhaochen Zhang, Chang Liu, and Yizhi Wang, Nanjing University; Vamsi Addanki and Stefan Schmid, TU Berlin; Qingyue Wang, Wei Chen, Xiaoliang Wang, and Jiaqi Zheng, Nanjing University; Wenhao Sun, Tao Wu, Ke Meng, Fei Chen, Weiguang Wang, and Bingyang Liu, Huawei, China; Wanchun Dou, Guihai Chen, and Chen Tian, Nanjing University

In modern datacenters, the effectiveness of end-to-end congestion control (CC) is quickly diminishing with the rapid bandwidth evolution. Per-hop flow control (FC) can react to congestion more promptly. However, a coarse-grained FC can result in Head-Of-Line (HOL) blocking. A fine-grained, per-flow FC can eliminate HOL blocking caused by flow control, however, it does not scale well. This paper presents Pyrrha, a scalable flow control approach that provably eliminates HOL blocking while using a minimum number of queues. In Pyrrha, flow control first takes effect on the root of the congestion, i.e., the port where congestion occurs. And then flows are controlled according to their contributed congestion roots. A prototype of Pyrrha is implemented on Tofino2 switches. Compared with state-of-the-art approaches, the average FCT of uncongested flows is reduced by 42%-98%, and 99th-tail latency can be 1.6×-215× lower, without compromising the performance of congested flows.

https://www.usenix.org/conference/nsdi25/presentation/liu-kexin

Monday April 28, 2025 2:00pm - 2:20pm EDT
Independence Ballroom

Track 2

2:20pm EDT

Preventing Network Bottlenecks: Accelerating Datacenter Services with Hotspot-Aware Placement for Compute and Storage

Monday April 28, 2025 2:20pm - 2:40pm EDT

Liberty Ballroom

Hamid Hajabdolali Bazzaz, Yingjie Bi, and Weiwu Pang, Google; Minlan Yu, Harvard University; Ramesh Govindan, University of Southern California; Neal Cardwell, Nandita Dukkipati, Meng-Jung Tsai, Chris DeForeest, and Yuxue Jin, Google; Charles Carver, Columbia University; Jan Kopański, Liqun Cheng, and Amin Vahdat, Google

Datacenter network hotspots, defined as links with persistently high utilization, can lead to performance bottlenecks.In this work, we study hotspots in Google’s datacenter networks. We find that these hotspots occur most frequently at ToR switches and can persist for hours. They are caused mainly by bandwidth demand-supply imbalance, largely due to high demand from network-intensive services, or demand exceeding available bandwidth when compute/storage upgrades outpace ToR bandwidth upgrades. Compounding this issue is bandwidth-independent task/data placement by data-center compute and storage schedulers. We quantify the performance impact of hotspots, and find that they can degrade the end-to-end latency of some distributed applications by over 2× relative to low utilization levels. Finally, we describe simple improvements we deployed. In our cluster scheduler, adding hotspot-aware task placement reduced the number of hot ToRs by 90%; in our distributed file system, adding hotspot-aware data placement reduced p95 network latency by more than 50%. While congestion control, load balancing, and traffic engineering can efficiently utilize paths for a fixed placement, we find hotspot-aware placement – placing tasks and data under ToRs with higher available bandwidth – is crucial for achieving consistently good performance.

https://www.usenix.org/conference/nsdi25/presentation/bazzaz

Monday April 28, 2025 2:20pm - 2:40pm EDT
Liberty Ballroom

Track 1

2:20pm EDT

eTran: Extensible Kernel Transport with eBPF

Monday April 28, 2025 2:20pm - 2:40pm EDT

Independence Ballroom

Zhongjie Chen, Tsinghua University; Qingkai Meng, Nanjing University; ChonLam Lao, Harvard University; Yifan Liu and Fengyuan Ren, Tsinghua University; Minlan Yu, Harvard University; Yang Zhou, UC Berkeley and UC Davis

Evolving datacenters with diverse application demands are driving network transport designs. However, few have successfully landed in the widely-used kernel networking stack to benefit broader users, and they take multiple years. We present eTran, a system that makes kernel transport extensible to implement and customize diverse transport designs agilely. To achieve this, eTran leverages and extends eBPF-based techniques to customize the kernel to support complex transport functionalities safely. Meanwhile, eTran carefully absorbs user-space transport techniques for performance gain without sacrificing robust protection. We implement TCP (with DCTCP congestion control) and Homa under eTran, and achieve up to 4.8×/1.8× higher throughput with 3.7×/7.5× lower latency compared to existing kernel implementation.

https://www.usenix.org/conference/nsdi25/presentation/chen-zhongjie

Monday April 28, 2025 2:20pm - 2:40pm EDT
Independence Ballroom

Track 2

2:40pm EDT

Enhancing Network Failure Mitigation with Performance-Aware Ranking

Monday April 28, 2025 2:40pm - 3:00pm EDT

Liberty Ballroom

Pooria Namyar and Arvin Ghavidel, University of Southern California; Daniel Crankshaw, Daniel S. Berger, Kevin Hsieh, and Srikanth Kandula, Microsoft; Ramesh Govindan, University of Southern California; Behnaz Arzani, Microsoft

Cloud providers install mitigations to reduce the impact of network failures within their datacenters. Existing network mitigation systems rely on simple local criteria or global proxy metrics to determine the best action. In this paper, we show that we can support a broader range of actions and select more effective mitigations by directly optimizing end-to-end flow-level metrics and analyzing actions holistically. To achieve this, we develop novel techniques to quickly estimate the impact of different mitigations and rank them with high fidelity. Our results on incidents from a large cloud provider show orders of magnitude improvements in flow completion time and throughput. We also show our approach scales to large datacenters.

https://www.usenix.org/conference/nsdi25/presentation/namyar

Monday April 28, 2025 2:40pm - 3:00pm EDT
Liberty Ballroom

Track 1

2:40pm EDT

White-Boxing RDMA with Packet-Granular Software Control

Monday April 28, 2025 2:40pm - 3:00pm EDT

Independence Ballroom

Chenxingyu Zhao and Jaehong Min, University of Washington; Ming Liu, University of Wisconsin-Madison; Arvind Krishnamurthy, University of Washington

Driven by diverse workloads and deployments, numerous innovations emerge to customize RDMA transport, spanning congestion control, multi-tenant isolation, routing, and more. However, RDMA's hardware-offloading nature poses significant rigidity when landing these innovations. Prior workflows to deliver customizations have either waited for lengthy hardware iterations, developed bespoke hardware, or applied coarse-grained control over the black-box RDMA NIC. Despite considerable efforts, current customization workflows still lack flexibility, raw performance, and broad availability.

In this work, we advocate for White-Boxing RDMA, which provides control of the hardware transport to general-purpose software while preserving raw data path performance. To facilitate the white-boxing methodology, we design and implement Software-Controlled RDMA (SCR), a framework enabling packet-granular software control over the hardware transport. To address challenges stemming from granular control over high-speed line rates, SCR employs effective control models, boosts the efficiency of subsystems within the framework, and leverages emerging hardware capabilities. We implement SCR on the latest Nvidia BlueField-3 equipped with Datapath Accelerators, delivering a spectrum of new customizations not present in legacy RDMA transport, such as Multi-Tenant Fair Scheduler, User-Defined Congestion Control, Receiver-Driven Flow Control, and Multi-Path Routing Selection. Furthermore, we demonstrate SCR's applicability for GPU-Direct and NVMe-oF RDMA with zero modifications to machine learning or storage code.

https://www.usenix.org/conference/nsdi25/presentation/zhao-chenxingyu

Monday April 28, 2025 2:40pm - 3:00pm EDT
Independence Ballroom

Track 2

3:00pm EDT

One-Size-Fits-None: Understanding and Enhancing Slow-Fault Tolerance in Modern Distributed Systems

Monday April 28, 2025 3:00pm - 3:20pm EDT

Liberty Ballroom

Ruiming Lu, University of Michigan and Shanghai Jiao Tong University; Yunchi Lu and Yuxuan Jiang, University of Michigan; Guangtao Xue, Shanghai Jiao Tong University; Peng Huang, University of Michigan

Recent studies have shown that various hardware components exhibit fail-slow behavior at scale. However, the characteristics of distributed software's tolerance of such slow faults remain ill-understood. This paper presents a comprehensive study that investigates the characteristics and current practices of slow-fault tolerance in modern distributed software. We focus on the fundamentally nuanced nature of slow faults. We develop a testing pipeline to systematically introduce diverse slow faults, measure their impact under different workloads, and identify the patterns. Our study shows that even small changes can lead to dramatically different reactions. While some systems have added slow-fault handling mechanisms, they are mostly controlled by static thresholds, which can hardly accommodate the highly sensitive and dynamic characteristics. To address this gap, we design ADR, a lightweight library to use within system code and make fail-slow handling adaptive. Evaluation shows ADR significantly reduces the impact of slow faults.

https://www.usenix.org/conference/nsdi25/presentation/lu

Monday April 28, 2025 3:00pm - 3:20pm EDT
Liberty Ballroom

Track 1

3:00pm EDT

SIRD: A Sender-Informed, Receiver-Driven Datacenter Transport Protocol

Monday April 28, 2025 3:00pm - 3:20pm EDT

Independence Ballroom

Konstantinos Prasopoulos, EPFL; Ryan Kosta, UCSD; Edouard Bugnion, EPFL; Marios Kogias, Imperial College London

Datacenter congestion control protocols are challenged to navigate the throughput-buffering trade-off while relative packet buffer capacity is trending lower year-over-year. In this context, receiver-driven protocols — which schedule packet transmissions instead of reacting to congestion — excel when the bottleneck lies at the ToR-to-receiver link. However, when multiple receivers must use a shared link (e.g., ToR to Spine), their independent schedules can conflict.

We present SIRD, a receiver-driven congestion control protocol designed around the simple insight that single-owner links should be scheduled, while shared links should be managed with reactive control algorithms. The approach allows receivers to both precisely schedule their downlinks and to coordinate over shared bottlenecks. Critically, SIRD also treats sender uplinks as shared links, enabling the flow of congestion feedback from senders to receivers, which then adapt their scheduling to each sender’s real time capacity. This results in tight scheduling, enabling high bandwidth utilization with little contention, and thus minimal latency-inducing buffering in the fabric.

We implement SIRD on top of the Caladan stack and show that SIRD’s asymmetric design can deliver 100Gbps in software while keeping network queuing minimal. We further compare SIRD to state-of-the-art receiver-driven protocols (Homa, dcPIM, and ExpressPass) and production-grade reactive protocols (Swift and DCTCP) and show that SIRD is uniquely able to simultaneously maximize link utilization, minimize queuing, and obtain near-optimal latency.

https://www.usenix.org/conference/nsdi25/presentation/prasopoulos

Monday April 28, 2025 3:00pm - 3:20pm EDT
Independence Ballroom

Track 2

3:20pm EDT

Coffee and Tea Break

Monday April 28, 2025 3:20pm - 3:50pm EDT

Liberty Ballroom Foyer

Monday April 28, 2025 3:20pm - 3:50pm EDT
Liberty Ballroom Foyer

3:50pm EDT

Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation

Monday April 28, 2025 3:50pm - 4:10pm EDT

Liberty Ballroom

Fei Gui, Tsinghua University; BNRist; Tsinghua Shenzhen International Graduate School; Kaihui Gao and Li Chen, Zhongguancun Laboratory; Dan Li, Tsinghua University; Vincent Liu, University of Pennsylvania; Ran Zhang and Hongbing Yang, Zhongguancun Laboratory; Dian Xiong, Tsinghua University

The rapid expansion of large language models (LLMs) requires the development of extensive GPU clusters, with companies deploying clusters with tens to hundreds of thousands of GPUs. This growth significantly expands the design space for LLM training systems, requiring thorough exploration of different parallelization strategies, communication parameters, congestion control, fabric topology, etc. Current methods require up to 10k simulation experiments to identify optimal configurations, with inadequate exploration leading to significant degradation of training performance.

In this paper, we tackle the overlooked problem of efficiently conducting parallel simulation experiments for design space exploration. Our analysis and experiments show that Single-process Multi-experiment (SPME) achieves superior performance by reducing scheduling overhead and optimizing resource utilization, yet remains insufficient for current AI cluster scales. To enhance SPME’s efficacy, we introduce Multiverse, a novel GPU-based AI training simulator. Multiverse leverages the computing throughput of GPUs efficiently with optimizations such as a pull-based synchronization, highfidelity intra-server communication, and a kernel-fusion technique. Extensive experiments validate the accuracy and efficiency of Multiverse, demonstrating less than 3.0% discrepancy with real-world LLM training on clusters of up to 54,000 GPUs, achieving 43.1−73.2X speedup over state-of-the-art CPU-based simulators in various use cases.

https://www.usenix.org/conference/nsdi25/presentation/gui

Monday April 28, 2025 3:50pm - 4:10pm EDT
Liberty Ballroom

Track 1

3:50pm EDT

Mowgli: Passively Learned Rate Control for Real-Time Video

Monday April 28, 2025 3:50pm - 4:10pm EDT

Independence Ballroom

Neil Agarwal and Rui Pan, Princeton University; Francis Y. Yan, University of Illinois Urbana-Champaign; Ravi Netravali, Princeton University

Rate control algorithms are at the heart of video conferencing platforms, determining target bitrates that match dynamic network characteristics for high quality. Despite the promise that recent data-driven strategies have shown for this challenging task, the performance degradation that they introduce during training has been a nonstarter for many production services, precluding adoption. This paper aims to bolster the practicality of data-driven rate control by presenting an alternate avenue for experiential learning: using purely existing telemetry logs that we surprisingly observe embed performant decisions but often at the wrong times or in the wrong order. To realize this approach despite the inherent uncertainty that log-based learning brings (i.e., lack of feedback for new decisions), our system, Mowgli, combines a variety of robust learning techniques (i.e., conservatively reasoning about alternate behavior to minimize risk and using a richer model formulation to account for environmental noise). Across diverse networks (emulated and real-world), Mowgli outperforms the widely deployed GCC algorithm, increasing average video bitrates by 15–39% while reducing freeze rates by 60–100%.

https://www.usenix.org/conference/nsdi25/presentation/agarwal

Monday April 28, 2025 3:50pm - 4:10pm EDT
Independence Ballroom

Track 2

4:10pm EDT

Optimizing RLHF Training for Large Language Models with Stage Fusion

Monday April 28, 2025 4:10pm - 4:30pm EDT

Liberty Ballroom

Yinmin Zhong, Zili Zhang, Bingyang Wu, and Shengyu Liu, School of Computer Science, Peking University; Yukun Chen, Changyi Wan, Hanpeng Hu, Lei Xia, Ranchen Ming, and Yibo Zhu, StepFun; Xin Jin, School of Computer Science, Peking University

We present RLHFuse, an efficient training system with stage fusion for Reinforcement Learning from Human Feedback (RLHF). Due to the intrinsic nature of RLHF training, i.e., the data skewness in the generation stage and the pipeline bubbles in the training stage, existing RLHF systems suffer from low GPU utilization. RLHFuse breaks the traditional view of RLHF workflow as a composition of individual tasks, splitting each task into finer-grained subtasks, and performing stage fusion to improve GPU utilization. RLHFuse contains two key ideas. First, for generation and inference tasks, RLHFuse splits them into sample-level subtasks, enabling efficient inter-stage fusion to overlap the execution of generation and inference stages, thus mitigating the original generation bottleneck dominated by long-tailed samples. Second, for training tasks, RLHFuse breaks them into subtasks of micro-batches and performs intra-stage fusion to concurrently execute these subtasks in the training stage with a fused pipeline schedule, effectively mitigating the pipeline bubbles. The experiments show that RLHFuse increases the training throughput by up to 3.7×, compared to existing systems.

https://www.usenix.org/conference/nsdi25/presentation/zhong

Monday April 28, 2025 4:10pm - 4:30pm EDT
Liberty Ballroom

Track 1

4:10pm EDT

Dissecting and Streamlining the Interactive Loop of Mobile Cloud Gaming

Monday April 28, 2025 4:10pm - 4:30pm EDT

Independence Ballroom

Yang Li, Jiaxing Qiu, Hongyi Wang, and Zhenhua Li, Tsinghua University; Feng Qian, University of Southern California; Jing Yang, Tsinghua University; Hao Lin, Tsinghua University and University of Illinois Urbana-Champaign; Yunhao Liu, Tsinghua University; Bo Xiao and Xiaokang Qin, Ant Group; Tianyin Xu, University of Illinois Urbana-Champaign

With cloud-side computing and rendering, mobile cloud gaming (MCG) is expected to deliver high-quality gaming experiences to budget mobile devices. However, our measurement on representative MCG platforms reveals that even under good network conditions, all platforms exhibit high interactive latency of 112–403 ms, from a user-input action to its display response, that critically affects users’ quality of experience. Moreover, jitters in network latency often lead to significant fluctuations in interactive latency.

In this work, we collaborate with a commercial MCG platform to conduct the first in-depth analysis on the interactive latency of cloud gaming. We identify VSync, the synchronization primitive of Android graphics pipeline, to be a key contributor to the excessive interactive latency; as many as five VSync events are intricately invoked, which serialize the complex graphics processing logic on both the client and cloud sides. To address this, we design an end-to-end VSync regulator, dubbed LoopTailor, which minimizes VSync events by decoupling game rendering from the lengthy cloud-side graphics pipeline and coordinating cloud game rendering directly with the client. We implement LoopTailor on the collaborated platform and commodity Android devices, reducing the interactive latency (by ∼34%) to stably below 100 ms.

https://www.usenix.org/conference/nsdi25/presentation/li-yang

Monday April 28, 2025 4:10pm - 4:30pm EDT
Independence Ballroom

Track 2

4:30pm EDT

Minder: Faulty Machine Detection for Large-scale Distributed Model Training

Monday April 28, 2025 4:30pm - 4:50pm EDT

Liberty Ballroom

Yangtao Deng, Tsinghua University; Xiang Shi and Zhuo Jiang, ByteDance; Xingjian Zhang, Tsinghua University; Lei Zhang, Zhang Zhang, Bo Li, Zuquan Song, Hang Zhu, and Gaohong Liu, ByteDance; Fuliang Li, Northeastern University; Shuguang Wang, Haibin Lin, and Jianxi Ye, ByteDance; Minlan Yu, Harvard University

Large-scale distributed model training requires simultaneous training on up to thousands of machines. Faulty machine detection is critical when an unexpected fault occurs in a machine. From our experience, a training task can encounter two faults per day on average, possibly leading to a halt for hours. To address the drawbacks of the time-consuming and labor-intensive manual scrutiny, we propose Minder, an automatic faulty machine detector for distributed training tasks. The key idea of Minder is to automatically and efficiently detect faulty distinctive monitoring metric patterns, which could last for a period before the entire training task comes to a halt. Minder has been deployed in our production environment for over one year, monitoring daily distributed training tasks where each involves up to thousands of machines. In our real-world fault detection scenarios, Minder can accurately and efficiently react to faults within 3.6 seconds on average, with a precision of 0.904 and F1-score of 0.893.

https://www.usenix.org/conference/nsdi25/presentation/deng

Monday April 28, 2025 4:30pm - 4:50pm EDT
Liberty Ballroom

Track 1

4:30pm EDT

Region-based Content Enhancement for Efﬁcient Video Analytics at the Edge

Monday April 28, 2025 4:30pm - 4:50pm EDT

Independence Ballroom

Weijun Wang, Institute for AI Industry Research (AIR), Tsinghua University; Liang Mi, Shaowei Cen, and Haipeng Dai, State Key Laboratory for Novel Software Technology, Nanjing University; Yuanchun Li, Institute for AI Industry Research (AIR), Tsinghua University; Xiaoming Fu, University of Göttingen; Yunxin Liu, Institute for AI Industry Research (AIR), Tsinghua University

Video analytics is widespread in various applications serving our society. Recent advances of content enhancement in video analytics offer signiﬁcant beneﬁts for the bandwidth saving and accuracy improvement. However, existing content-enhanced video analytics systems are excessively computationally expensive and provide extremely low throughput. In this paper, we present region-based content enhancement, that enhances only the important regions in videos, to improve analytical accuracy. Our system, RegenHance, enables high-accuracy and high-throughput video analytics at the edge by 1) a macroblock-based region importance predictor that identifies the important regions fast and precisely, 2) a regionaware enhancer that stitches sparsely distributed regions into dense tensors and enhances them efﬁciently, and 3) a proﬁle-based execution planer that allocates appropriate resources for enhancement and analytics components. We prototype RegenHance on ﬁve heterogeneous edge devices. Experiments on two analytical tasks reveal that region-based enhancement improves the overall accuracy of 10-19% and achieves 2-3× throughput compared to the state-of-the-art frame-based enhancement methods.

https://www.usenix.org/conference/nsdi25/presentation/wang-weijun

Monday April 28, 2025 4:30pm - 4:50pm EDT
Independence Ballroom

Track 2

4:50pm EDT

Holmes: Localizing Irregularities in LLM Training with Mega-scale GPU Clusters

Monday April 28, 2025 4:50pm - 5:10pm EDT

Liberty Ballroom

Zhiyi Yao and Pengbo Hu, Fudan University and Tencent; Congcong Miao and Xuya Jia, Tencent; Zuning Liang and Yuedong Xu, Fudan University; Chunzhi He, Hao Lu, Mingzhuo Chen, Xiang Li, Zekun He, Yachen Wang, and Xianneng Zou, Tencent; Junchen Jiang, University of Chicago

Training Large Language Models (LLMs) on large-scale GPU clusters requires numerous iterations over several months. Existing works mainly focus on addressing failures that interrupt the iterative training process to improve the utilization of GPU clusters. However, our large-scale measurements over tens of thousands of GPUs show that the training process exhibits an unstable state with some irregular iterations taking even more than twice the time of a normal iteration. Surprisingly, we find that these irregular iterations greatly extend the time of LLM training, which is even more severe than the impact of failures. Meanwhile, the irregular phenomenon is silent, making it challenging to be accurately localized. In this paper, we propose a first-of-its-kind system called Holmes, leveraging communication operators to accurately localize these irregularities in real-time. The core of Holmes's approach is to employ an enhanced abnormal operator detection model and a novel communication operator graph to perform efficient irregularity localization. Furthermore, Holmes conducts cross-iteration analysis to improve localization accuracy. We evaluate Holmes using large-scale trace-driven simulations and a production-level prototype. Large-scale simulation results demonstrate that Holmes achieves irregularity localization accuracy of 97.21%. Production-level prototype evaluation results show Holmes can localize irregularity within 30.3 seconds, achieving a speedup of 6.52× as compared to traditional approaches.

https://www.usenix.org/conference/nsdi25/presentation/yao

Monday April 28, 2025 4:50pm - 5:10pm EDT
Liberty Ballroom

Track 1

4:50pm EDT

Tooth: Toward Optimal Balance of Video QoE and Redundancy Cost by Fine-Grained FEC in Cloud Gaming Streaming

Monday April 28, 2025 4:50pm - 5:10pm EDT

Independence Ballroom

Congkai An, Huanhuan Zhang, Shibo Wang, Jingyang Kang, Anfu Zhou, Liang Liu, and Huadong Ma, Beijing University of Posts and Telecommunications; Zili Meng, Hong Kong University of Science and Technology; Delei Ma, Yusheng Dong, and Xiaogang Lei, Well-Link Times Inc.

Despite the rapid rise of cloud gaming, real-world evaluations of its quality of experience (QoE) remain scarce. To fill this gap, we conduct a large-scale measurement campaign, analyzing over 60,000 sessions on an operational cloud gaming platform. We find that current cloud gaming streaming suffers from substantial bandwidth wastage and severe interaction stalls simultaneously. In-depth investigation reveals the underlying reason, i.e., existing streaming adopts coarse-grained Forward Error Correction (FEC) encoding, without considering the adverse impact of frame length variation, which results in over-protection of large frames (i.e., bandwidth waste) and under-protection of smaller ones (i.e., interaction stalls). To remedy the problem, we propose Tooth, a per-frame adaptive FEC that aims to achieve the optimal balance between satisfactory QoE and efficient bandwidth usage. To build Tooth, we design a dual-module FEC encoding strategy, which takes full consideration of both frame length variation and network dynamics, and hence determines an appropriate FEC redundancy rate for each frame. Moreover, we also circumvent the formidable per-frame FEC computational overhead by designing a lightweight Tooth, so as to meet the rigid latency bound of real-time cloud gaming. We implement, deploy, and evaluate Tooth in the operational cloud gaming system. Extensive field tests demonstrate that Tooth significantly outperforms existing state-of-the-art FEC methods, reducing stall rates by 40.2% to 85.2%, enhancing video bitrates by 11.4% to 29.2%, and lowering bandwidth costs by 54.9% to 75.0%.

https://www.usenix.org/conference/nsdi25/presentation/an

Monday April 28, 2025 4:50pm - 5:10pm EDT
Independence Ballroom

Track 2

5:10pm EDT

SimAI: Unifying Architecture Design and Performance Tuning for Large-Scale Large Language Model Training with Scalability and Precision

Monday April 28, 2025 5:10pm - 5:30pm EDT

Liberty Ballroom

Xizheng Wang, Alibaba Cloud and Tsinghua University; Qingxu Li, Yichi Xu, and Gang Lu, Alibaba Cloud; Dan Li, Tsinghua University; Li Chen, Zhongguancun Laboratory; Heyang Zhou, Alibaba Cloud; Linkang Zheng, Alibaba Cloud and South China University of Technology; Sen Zhang, Yikai Zhu, Yang Liu, Pengcheng Zhang, Kun Qian, Kunling He, Jiaqi Gao, and Ennan Zhai, Alibaba Cloud; Dennis Cai, Alibaba Group; Binzhang Fu, Alibaba Cloud

The large number of GPUs required for a single LLM training significantly hinders the validation of new designs, tunings, and optimizations, calling for the occurrence of efficient simulators. Existing simulators, however, only target a specific granularity of the entire training, intrinsically leading to imprecision. This paper presents SimAI, a unified simulator aiming at precisely and efficiently simulating the LLM training procedure at scale. Through selective and high-fidelity integration of the training frameworks, the kernel computation, and the collective communication library into the simulating procedure, SimAI achieves high precision in simulations. SimAI further conducts multi-thread acceleration and implements lock-free global context-sharing to accelerate the execution speed. The effectiveness of SimAI is validated by its performance results, which show an average of 98.1% alignment to real-world results under various test scenarios and affirm its robustness and adaptability from small-scale labs to large-scale industrial environments. SimAI delivers meaningful guidelines for new host designs and parameter settings, directly benefiting in-production LLM training. We also share experiences and lessons learned during the evolution of SimAI. SimAI is open sourced at https://github.com/aliyun/SimAI.

https://www.usenix.org/conference/nsdi25/presentation/wang-xizheng-simai

Monday April 28, 2025 5:10pm - 5:30pm EDT
Liberty Ballroom

Track 1

5:10pm EDT

AsTree: An Audio Subscription Architecture Enabling Massive-Scale Multi-Party Conferencing

Monday April 28, 2025 5:10pm - 5:30pm EDT

Independence Ballroom

Tong Meng, Wenfeng Li, Chao Yuan, Changqing Yan, and Le Zhang, ByteDance Inc.

While operating a multi-party video conferencing system (Lark) globally, we find that audio subscription alone may pose considerable challenges to the network, especially when scaling towards massive scales. Traditional strategy of subscribing to all remote participants suffers from issues such as signaling storm, excessive bandwidth and resource consumption on both server and client sides. Aimed at enhanced scalability, we share our design of AsTree, an audio subscription architecture. By a cascading tree topology and media plane-based audio selection, AsTree dramatically reduces the number of signaling messages and audio streams to forward. Practical deployment in Lark reduces audio and video stall ratios by more than 30% and 50%. We also receive 40% less negative client reviews, strongly proving the value of AsTree.

https://www.usenix.org/conference/nsdi25/presentation/meng

Monday April 28, 2025 5:10pm - 5:30pm EDT
Independence Ballroom

Track 2

5:30pm EDT

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

Monday April 28, 2025 5:30pm - 5:50pm EDT

Liberty Ballroom

Borui Wan, The University of Hong Kong; Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, and Xin Liu, ByteDance Inc.; Chuan Wu, The University of Hong Kong

Checkpointing to preserve training states is crucial during the development of Large Foundation Models (LFMs), for training resumption upon various failures or changes in GPU resources and parallelism configurations. In addition, saved checkpoints are dispatched to evaluation tasks or transferred across different training stages (e.g., from pre-training to post-training). All these scenarios require resharding distributed checkpoints from one parallelism to another. In production environments, different LFMs are trained with various frameworks and storage backends, depending on model sizes and training scales. A high performance checkpointing system is needed to enable efficient checkpoint management at scale throughout the lifecycle of LFM development. We introduce ByteCheckpoint, an industrial-grade checkpointing system for large-scale LFM training. ByteCheckpoint features: a parallelism-agnostic checkpoint representation that enables efficient load-time checkpoint resharding; a generic checkpoint saving/loading workflow to accommodate multiple training frameworks and support different storage backends; full-stack optimizations to ensure high I/O efficiency and scalability; a suite of monitoring tools to streamline large-scale performance analysis and bottleneck detection. Compared to existing open-source checkpointing systems [51, 57], ByteCheckpoint significantly reduces runtime checkpoint stalls, achieving an average reduction of 54.20×. For saving and loading times, ByteCheckpoint achieves improvements of up to 9.96× and 8.80×, respectively.

https://www.usenix.org/conference/nsdi25/presentation/wan-borui

Monday April 28, 2025 5:30pm - 5:50pm EDT
Liberty Ballroom

Track 1

6:00pm EDT

ByteDance Sponsor Event

Monday April 28, 2025 6:00pm - 7:00pm EDT

Franklin Hall 5-6

Join ByteDance for a panel introduction with networking experts and unveil career development opportunities!

Sponsors

ByteDance

Monday April 28, 2025 6:00pm - 7:00pm EDT
Franklin Hall 5-6

Activities

6:00pm EDT

Mentoring

Monday April 28, 2025 6:00pm - 7:00pm EDT

Franklin Hall 7-8

The NSDI '25 mentorship program is designed to give students and recent graduates a chance to network with other attendees, get career advice from a senior member of the community, and obtain feedback on their research. If you are interested in participating in this activity, either as a mentor or mentee, see the NSDI '25 Mentorship Program page for details.

Monday April 28, 2025 6:00pm - 7:00pm EDT
Franklin Hall 7-8

Activities