NSDI '25: 22nd USENIX Symposium on Networked Systems Design and Implementation: Full Schedule

9:10am EDT

Enabling Silent Telemetry Data Transmission with InvisiFlow

Monday April 28, 2025 9:10am - 9:30am EDT

Yinda Zhang, University of Pennsylvania; Liangcheng Yu, University of Pennsylvania and Microsoft Research; Gianni Antichi, Politecnico di Milano and Queen Mary University of London; Ran Ben Basat, University College London; Vincent Liu, University of Pennsylvania

Network applications from traffic engineering to path tracing often rely on the ability to transmit fine-grained telemetry data from network devices to a set of collectors. Unfortunately, prior work has observed—and we validate—that existing transmission methods for such data can result in significant overhead to user traffic and/or loss of telemetry data, particularly when the network is heavily loaded.

In this paper, we introduce InvisiFlow, a novel communication substrate to collect network telemetry data, silently. In contrast to previous systems that always push telemetry packets to collectors based on the shortest path, InvisiFlow dynamically seeks out spare network capacity by leveraging opportunistic sending and congestion gradients, thus minimizing both the loss rate of telemetry data and overheads on user traffic. In a FatTree topology, InvisiFlow can achieve near-zero loss rate even under high-load scenarios (around 33.8× lower loss compared to the state-of-the-art transmission methods used by systems like Everflow and Planck).

https://www.usenix.org/conference/nsdi25/presentation/zhang-yinda

Monday April 28, 2025 9:10am - 9:30am EDT
Independence Ballroom

Track 2

9:30am EDT

Unlocking ECMP Programmability for Precise Traffic Control

Monday April 28, 2025 9:30am - 9:50am EDT

Independence Ballroom

Yadong Liu, Tencent; Yunming Xiao, University of Michigan; Xuan Zhang, Weizhen Dang, Huihui Liu, Xiang Li, and Zekun He, Tencent; Jilong Wang, Tsinghua University; Aleksandar Kuzmanovic, Northwestern University; Ang Chen, University of Michigan; Congcong Miao, Tencent

ECMP (equal-cost multi-path) has become a fundamental mechanism in data centers, which distributes flows along multiple equivalent paths based on their hash values. Randomized distribution optimizes for the aggregate case, spreading load across flows over time. However, there exists a class of important Precise Traffic Control (PTC) tasks that are at odds with ECMP randomness. For instance, if an end host perceives that its flows are traversing a problematic switch/link, it often needs to change their paths before a fix can be rolled out. With randomized hashing, existing solutions resort to modifying flow tuples; since hashing mechanisms are unknown and they vary across switches/vendors, it may take many trials before yielding a new path. Many other similar cases exist where precise and timely response is critical to the network.

We propose programmable ECMP (P-ECMP), a programming model, compiler, and runtime that provides precise traffic control. P-ECMP leverages an oft-ignored feature, ECMP groups, which allows for a constrained set of capabilities that are nonetheless sufficiently expressive for our tasks. An operator supplies high-level descriptions of their topology and policies, and our compiler generates PTC configurations for each switch. End hosts can reconfigure specific flows to use different PTC policies precisely and quickly, addressing a range of important use cases. We have evaluated P-ECMP using simulation at scale, and deployed one use case to a real-world data center that serves live user traffic.

https://www.usenix.org/conference/nsdi25/presentation/liu-yadong

Monday April 28, 2025 9:30am - 9:50am EDT
Independence Ballroom

Track 2

9:50am EDT

Enabling Portable and High-Performance SmartNIC Programs with Alkali

Monday April 28, 2025 9:50am - 10:10am EDT

Independence Ballroom

Jiaxin Lin, UT Austin; Zhiyuan Guo, UCSD; Mihir Shah, NVIDIA; Tao Ji, Microsoft; Yiying Zhang, UCSD; Daehyeok Kim and Aditya Akella, UT Austin

Trends indicate that emerging SmartNICs, either from different vendors or generations from the same vendor, exhibit substantial differences in hardware parallelism and memory interconnects. These variations make porting programs across NICs highly complex and time-consuming, requiring programmers to significantly refactor code for performance based on each target NIC’s hardware characteristics.

We argue that an ideal SmartNIC compilation framework should allow developers to write target-independent programs, with the compiler automatically managing cross-NIC porting and performance optimization. We present such a framework, Alkali, that achieves this by (1) proposing a new intermediate representation for building flexible compiler infrastructure for multiple NIC targets and (2) developing a new iterative parallelism optimization algorithm that automatically ports and parallelizes the input programs based on the target NIC’s hardware characteristics.

Experiments across a wide range of NIC applications demonstrate that Alkali enables developers to easily write portable, high-performance NIC programs. Our compiler optimization passes can automatically port these programs and make them run efficiently across all targets, achieving performance within 9.8% of hand-tuned expert implementations.

https://www.usenix.org/conference/nsdi25/presentation/lin-jiaxin

Monday April 28, 2025 9:50am - 10:10am EDT
Independence Ballroom

Track 2

10:10am EDT

Scaling IP Lookup to Large Databases using the CRAM Lens

Monday April 28, 2025 10:10am - 10:30am EDT

Independence Ballroom

Robert Chang and Pradeep Dogga, University of California, Los Angeles; Andy Fingerhut, Cisco Systems; Victor Rios and George Varghese, University of California, Los Angeles

Wide-area scaling trends require new approaches to Internet Protocol (IP) lookup, enabled by modern networking chips such as Intel Tofino, AMD Pensando, and Nvidia BlueField, which provide substantial ternary content-addressable memory (TCAM) and static random-access memory (SRAM). However, designing and evaluating scalable algorithms for these chips is challenging due to hardware-level constraints. To address this, we introduce the CRAM (CAM+RAM) lens, a framework that combines a formal model for evaluating algorithms on modern network processors with a set of optimization idioms. We demonstrate the effectiveness of CRAM by designing and evaluating three new IP lookup schemes: RESAIL, BSIC, and MashUp. RESAIL enables Tofino-2 to scale to 2.25 million IPv4 prefixes—likely sufficient for the next decade—while a pure TCAM approach supports only 250k prefixes, just 27% of the current global IPv4 routing table. Similarly, BSIC scales to 390k IPv6 prefixes on Tofino-2, supporting 3.2 times as many prefixes as a pure TCAM implementation. In contrast, existing state-of-the-art algorithms, SAIL for IPv4 and Hi-BST for IPv6, scale to considerably smaller sizes on Tofino-2.

https://www.usenix.org/conference/nsdi25/presentation/chang

Monday April 28, 2025 10:10am - 10:30am EDT
Independence Ballroom

Track 2

11:00am EDT

On Temporal Verification of Stateful P4 Programs

Monday April 28, 2025 11:00am - 11:20am EDT

Independence Ballroom

Delong Zhang, Chong Ye, and Fei He, School of Software, BNRist, Tsinghua University, Beijing 100084, China; Key Laboratory for Information System Security, MoE, China

Stateful P4 programs offload network states from the control plane to the data plane, enabling unprecedented network programmability. However, existing P4 verifiers overapproximate the stateful nature of P4 programs and are inherently inadequate for verifying network functions that require stateful decision-making.

To overcome this limitation, this paper introduces an innovative approach to verify P4 programs while accounting for their stateful feature. We propose a specification language named P4LTL, tailored for describing temporal properties of stateful P4 programs at the packet processing level. Additionally, we introduce a novel concept called the Büchi transaction, representing the product of the P4 program and the P4LTL specification. The P4 program verification problem can be reduced to determining the existence of any fair and feasible trace within the Büchi transaction. To the best of our knowledge, our approach represents the first endeavor in temporal verification of stateful P4 programs at the packet processing level. We implemented a prototype tool called p4tv. Evaluation results demonstrate p4tv’s effectiveness and efficiency in temporal verification of stateful P4 programs.

https://www.usenix.org/conference/nsdi25/presentation/zhang-delong

Monday April 28, 2025 11:00am - 11:20am EDT
Independence Ballroom

Track 2

11:20am EDT

NDD: A Decision Diagram for Network Verification

Monday April 28, 2025 11:20am - 11:40am EDT

Independence Ballroom

Zechun Li, Peng Zhang, and Yichi Zhang, Xi'an Jiaotong University; Hongkun Yang, Google

State-of-the-art network verifiers extensively use Binary Decision Diagram (BDD) as the underlying data structure to represent the network state and equivalence classes. Despite its wide usage, we find BDD is not ideal for network verification: verifiers need to handle the low-level computation of equivalence classes, and still face scalability issues when the network state has a lot of bits.

To this end, this paper introduces Network Decision Diagram (NDD), a new decision diagram customized for network verification. In a nutshell, NDD wraps BDD with another layers of decision diagram, such that each NDD node represents a field of the network, and each edge is labeled with a BDD encoding the values of that field. We designed and implemented a library for NDD, which features a native support for equivalence classes, and higher efficiency in memory and computation. Using the NDD library, we re-implemented five BDD-based verifiers with minor modifications to their original codes, and observed a 100× reduction in memory cost and 100× speedup. This indicates that NDD provides a drop-in replacement of BDD for network verifiers.

https://www.usenix.org/conference/nsdi25/presentation/li-zechun

Monday April 28, 2025 11:20am - 11:40am EDT
Independence Ballroom

Track 2

11:40am EDT

Smart Casual Verification of the Confidential Consortium Framework

Monday April 28, 2025 11:40am - 12:00pm EDT

Independence Ballroom

Heidi Howard, Markus A. Kuppe, Edward Ashton, and Amaury Chamayou, Azure Research, Microsoft; Natacha Crooks, Azure Research, Microsoft and UC Berkeley

The Confidential Consortium Framework (CCF) is an open-source platform for developing trustworthy and reliable cloud applications. CCF powers Microsoft's Azure Confidential Ledger service and as such it is vital to build confidence in the correctness of CCF's design and implementation. This paper reports our experiences applying smart casual verification to validate the correctness of CCF's novel distributed protocols, focusing on its unique distributed consensus protocol and its custom client consistency model. We use the term smart casual verification to describe our hybrid approach, which combines the rigor of formal specification and model checking with the pragmatism of automated testing, in our case binding the formal specification in TLA+ to the C++ implementation. While traditional formal methods approaches require substantial buy-in and are often one-off efforts by domain experts, we have integrated our smart casual verification approach into CCF's CI pipeline, allowing contributors to continuously validate CCF as it evolves. We describe the challenges we faced in applying smart casual verification to a complex existing codebase and how we overcame them to find six subtle bugs in the design and implementation before they could impact production.

https://www.usenix.org/conference/nsdi25/presentation/howard

Monday April 28, 2025 11:40am - 12:00pm EDT
Independence Ballroom

Track 2

12:00pm EDT

VEP: A Two-stage Verification Toolchain for Full eBPF Programmability

Monday April 28, 2025 12:00pm - 12:20pm EDT

Independence Ballroom

Xiwei Wu, Yueyang Feng, Tianyi Huang, Xiaoyang Lu, Shengkai Lin, Lihan Xie, Shizhen Zhao, and Qinxiang Cao, Shanghai Jiao Tong University

Extended Berkely Package Filter (eBPF) is a revolutionary technology that can safely and efficiently extend kernel capabilities. It has been widely used in networking, tracing, security, and more. However, existing eBPF verifiers impose strict constraints, often requiring repeated modifications to eBPF programs to pass verification. To enhance programmability, we introduce VEP, an annotation-guided eBPF program verification toolchain. VEP consists of three components: VEP-C, a verifier for annotated eBPF-C programs; VEP-compiler, a compiler targeting annotated eBPF bytecode; and VEP-eBPF, a lightweight bytecode-level proof checker. VEP allows users to verify the correctness of their programs with appropriate annotations, thus enabling full programmability. Our experimental results demonstrate that VEP addresses the limitations of existing verifiers, i.e. the Linux verifier and PREVAIL, and provides a more flexible and automated approach to kernel security.

https://www.usenix.org/conference/nsdi25/presentation/wu-xiwei

Monday April 28, 2025 12:00pm - 12:20pm EDT
Independence Ballroom

Track 2

2:00pm EDT

Pyrrha: Congestion-Root-Based Flow Control to Eliminate Head-of-Line Blocking in Datacenter

Monday April 28, 2025 2:00pm - 2:20pm EDT

Independence Ballroom

Kexin Liu, Zhaochen Zhang, Chang Liu, and Yizhi Wang, Nanjing University; Vamsi Addanki and Stefan Schmid, TU Berlin; Qingyue Wang, Wei Chen, Xiaoliang Wang, and Jiaqi Zheng, Nanjing University; Wenhao Sun, Tao Wu, Ke Meng, Fei Chen, Weiguang Wang, and Bingyang Liu, Huawei, China; Wanchun Dou, Guihai Chen, and Chen Tian, Nanjing University

In modern datacenters, the effectiveness of end-to-end congestion control (CC) is quickly diminishing with the rapid bandwidth evolution. Per-hop flow control (FC) can react to congestion more promptly. However, a coarse-grained FC can result in Head-Of-Line (HOL) blocking. A fine-grained, per-flow FC can eliminate HOL blocking caused by flow control, however, it does not scale well. This paper presents Pyrrha, a scalable flow control approach that provably eliminates HOL blocking while using a minimum number of queues. In Pyrrha, flow control first takes effect on the root of the congestion, i.e., the port where congestion occurs. And then flows are controlled according to their contributed congestion roots. A prototype of Pyrrha is implemented on Tofino2 switches. Compared with state-of-the-art approaches, the average FCT of uncongested flows is reduced by 42%-98%, and 99th-tail latency can be 1.6×-215× lower, without compromising the performance of congested flows.

https://www.usenix.org/conference/nsdi25/presentation/liu-kexin

Monday April 28, 2025 2:00pm - 2:20pm EDT
Independence Ballroom

Track 2

2:20pm EDT

eTran: Extensible Kernel Transport with eBPF

Monday April 28, 2025 2:20pm - 2:40pm EDT

Independence Ballroom

Zhongjie Chen, Tsinghua University; Qingkai Meng, Nanjing University; ChonLam Lao, Harvard University; Yifan Liu and Fengyuan Ren, Tsinghua University; Minlan Yu, Harvard University; Yang Zhou, UC Berkeley and UC Davis

Evolving datacenters with diverse application demands are driving network transport designs. However, few have successfully landed in the widely-used kernel networking stack to benefit broader users, and they take multiple years. We present eTran, a system that makes kernel transport extensible to implement and customize diverse transport designs agilely. To achieve this, eTran leverages and extends eBPF-based techniques to customize the kernel to support complex transport functionalities safely. Meanwhile, eTran carefully absorbs user-space transport techniques for performance gain without sacrificing robust protection. We implement TCP (with DCTCP congestion control) and Homa under eTran, and achieve up to 4.8×/1.8× higher throughput with 3.7×/7.5× lower latency compared to existing kernel implementation.

https://www.usenix.org/conference/nsdi25/presentation/chen-zhongjie

Monday April 28, 2025 2:20pm - 2:40pm EDT
Independence Ballroom

Track 2

2:40pm EDT

White-Boxing RDMA with Packet-Granular Software Control

Monday April 28, 2025 2:40pm - 3:00pm EDT

Independence Ballroom

Chenxingyu Zhao and Jaehong Min, University of Washington; Ming Liu, University of Wisconsin-Madison; Arvind Krishnamurthy, University of Washington

Driven by diverse workloads and deployments, numerous innovations emerge to customize RDMA transport, spanning congestion control, multi-tenant isolation, routing, and more. However, RDMA's hardware-offloading nature poses significant rigidity when landing these innovations. Prior workflows to deliver customizations have either waited for lengthy hardware iterations, developed bespoke hardware, or applied coarse-grained control over the black-box RDMA NIC. Despite considerable efforts, current customization workflows still lack flexibility, raw performance, and broad availability.

In this work, we advocate for White-Boxing RDMA, which provides control of the hardware transport to general-purpose software while preserving raw data path performance. To facilitate the white-boxing methodology, we design and implement Software-Controlled RDMA (SCR), a framework enabling packet-granular software control over the hardware transport. To address challenges stemming from granular control over high-speed line rates, SCR employs effective control models, boosts the efficiency of subsystems within the framework, and leverages emerging hardware capabilities. We implement SCR on the latest Nvidia BlueField-3 equipped with Datapath Accelerators, delivering a spectrum of new customizations not present in legacy RDMA transport, such as Multi-Tenant Fair Scheduler, User-Defined Congestion Control, Receiver-Driven Flow Control, and Multi-Path Routing Selection. Furthermore, we demonstrate SCR's applicability for GPU-Direct and NVMe-oF RDMA with zero modifications to machine learning or storage code.

https://www.usenix.org/conference/nsdi25/presentation/zhao-chenxingyu

Monday April 28, 2025 2:40pm - 3:00pm EDT
Independence Ballroom

Track 2

3:00pm EDT

SIRD: A Sender-Informed, Receiver-Driven Datacenter Transport Protocol

Monday April 28, 2025 3:00pm - 3:20pm EDT

Independence Ballroom

Konstantinos Prasopoulos, EPFL; Ryan Kosta, UCSD; Edouard Bugnion, EPFL; Marios Kogias, Imperial College London

Datacenter congestion control protocols are challenged to navigate the throughput-buffering trade-off while relative packet buffer capacity is trending lower year-over-year. In this context, receiver-driven protocols — which schedule packet transmissions instead of reacting to congestion — excel when the bottleneck lies at the ToR-to-receiver link. However, when multiple receivers must use a shared link (e.g., ToR to Spine), their independent schedules can conflict.

We present SIRD, a receiver-driven congestion control protocol designed around the simple insight that single-owner links should be scheduled, while shared links should be managed with reactive control algorithms. The approach allows receivers to both precisely schedule their downlinks and to coordinate over shared bottlenecks. Critically, SIRD also treats sender uplinks as shared links, enabling the flow of congestion feedback from senders to receivers, which then adapt their scheduling to each sender’s real time capacity. This results in tight scheduling, enabling high bandwidth utilization with little contention, and thus minimal latency-inducing buffering in the fabric.

We implement SIRD on top of the Caladan stack and show that SIRD’s asymmetric design can deliver 100Gbps in software while keeping network queuing minimal. We further compare SIRD to state-of-the-art receiver-driven protocols (Homa, dcPIM, and ExpressPass) and production-grade reactive protocols (Swift and DCTCP) and show that SIRD is uniquely able to simultaneously maximize link utilization, minimize queuing, and obtain near-optimal latency.

https://www.usenix.org/conference/nsdi25/presentation/prasopoulos

Monday April 28, 2025 3:00pm - 3:20pm EDT
Independence Ballroom

Track 2

3:50pm EDT

Mowgli: Passively Learned Rate Control for Real-Time Video

Monday April 28, 2025 3:50pm - 4:10pm EDT

Independence Ballroom

Neil Agarwal and Rui Pan, Princeton University; Francis Y. Yan, University of Illinois Urbana-Champaign; Ravi Netravali, Princeton University

Rate control algorithms are at the heart of video conferencing platforms, determining target bitrates that match dynamic network characteristics for high quality. Despite the promise that recent data-driven strategies have shown for this challenging task, the performance degradation that they introduce during training has been a nonstarter for many production services, precluding adoption. This paper aims to bolster the practicality of data-driven rate control by presenting an alternate avenue for experiential learning: using purely existing telemetry logs that we surprisingly observe embed performant decisions but often at the wrong times or in the wrong order. To realize this approach despite the inherent uncertainty that log-based learning brings (i.e., lack of feedback for new decisions), our system, Mowgli, combines a variety of robust learning techniques (i.e., conservatively reasoning about alternate behavior to minimize risk and using a richer model formulation to account for environmental noise). Across diverse networks (emulated and real-world), Mowgli outperforms the widely deployed GCC algorithm, increasing average video bitrates by 15–39% while reducing freeze rates by 60–100%.

https://www.usenix.org/conference/nsdi25/presentation/agarwal

Monday April 28, 2025 3:50pm - 4:10pm EDT
Independence Ballroom

Track 2

4:10pm EDT

Dissecting and Streamlining the Interactive Loop of Mobile Cloud Gaming

Monday April 28, 2025 4:10pm - 4:30pm EDT

Independence Ballroom

Yang Li, Jiaxing Qiu, Hongyi Wang, and Zhenhua Li, Tsinghua University; Feng Qian, University of Southern California; Jing Yang, Tsinghua University; Hao Lin, Tsinghua University and University of Illinois Urbana-Champaign; Yunhao Liu, Tsinghua University; Bo Xiao and Xiaokang Qin, Ant Group; Tianyin Xu, University of Illinois Urbana-Champaign

With cloud-side computing and rendering, mobile cloud gaming (MCG) is expected to deliver high-quality gaming experiences to budget mobile devices. However, our measurement on representative MCG platforms reveals that even under good network conditions, all platforms exhibit high interactive latency of 112–403 ms, from a user-input action to its display response, that critically affects users’ quality of experience. Moreover, jitters in network latency often lead to significant fluctuations in interactive latency.

In this work, we collaborate with a commercial MCG platform to conduct the first in-depth analysis on the interactive latency of cloud gaming. We identify VSync, the synchronization primitive of Android graphics pipeline, to be a key contributor to the excessive interactive latency; as many as five VSync events are intricately invoked, which serialize the complex graphics processing logic on both the client and cloud sides. To address this, we design an end-to-end VSync regulator, dubbed LoopTailor, which minimizes VSync events by decoupling game rendering from the lengthy cloud-side graphics pipeline and coordinating cloud game rendering directly with the client. We implement LoopTailor on the collaborated platform and commodity Android devices, reducing the interactive latency (by ∼34%) to stably below 100 ms.

https://www.usenix.org/conference/nsdi25/presentation/li-yang

Monday April 28, 2025 4:10pm - 4:30pm EDT
Independence Ballroom

Track 2

4:30pm EDT

Region-based Content Enhancement for Efﬁcient Video Analytics at the Edge

Monday April 28, 2025 4:30pm - 4:50pm EDT

Independence Ballroom

Weijun Wang, Institute for AI Industry Research (AIR), Tsinghua University; Liang Mi, Shaowei Cen, and Haipeng Dai, State Key Laboratory for Novel Software Technology, Nanjing University; Yuanchun Li, Institute for AI Industry Research (AIR), Tsinghua University; Xiaoming Fu, University of Göttingen; Yunxin Liu, Institute for AI Industry Research (AIR), Tsinghua University

Video analytics is widespread in various applications serving our society. Recent advances of content enhancement in video analytics offer signiﬁcant beneﬁts for the bandwidth saving and accuracy improvement. However, existing content-enhanced video analytics systems are excessively computationally expensive and provide extremely low throughput. In this paper, we present region-based content enhancement, that enhances only the important regions in videos, to improve analytical accuracy. Our system, RegenHance, enables high-accuracy and high-throughput video analytics at the edge by 1) a macroblock-based region importance predictor that identifies the important regions fast and precisely, 2) a regionaware enhancer that stitches sparsely distributed regions into dense tensors and enhances them efﬁciently, and 3) a proﬁle-based execution planer that allocates appropriate resources for enhancement and analytics components. We prototype RegenHance on ﬁve heterogeneous edge devices. Experiments on two analytical tasks reveal that region-based enhancement improves the overall accuracy of 10-19% and achieves 2-3× throughput compared to the state-of-the-art frame-based enhancement methods.

https://www.usenix.org/conference/nsdi25/presentation/wang-weijun

Monday April 28, 2025 4:30pm - 4:50pm EDT
Independence Ballroom

Track 2

4:50pm EDT

Tooth: Toward Optimal Balance of Video QoE and Redundancy Cost by Fine-Grained FEC in Cloud Gaming Streaming

Monday April 28, 2025 4:50pm - 5:10pm EDT

Independence Ballroom

Congkai An, Huanhuan Zhang, Shibo Wang, Jingyang Kang, Anfu Zhou, Liang Liu, and Huadong Ma, Beijing University of Posts and Telecommunications; Zili Meng, Hong Kong University of Science and Technology; Delei Ma, Yusheng Dong, and Xiaogang Lei, Well-Link Times Inc.

Despite the rapid rise of cloud gaming, real-world evaluations of its quality of experience (QoE) remain scarce. To fill this gap, we conduct a large-scale measurement campaign, analyzing over 60,000 sessions on an operational cloud gaming platform. We find that current cloud gaming streaming suffers from substantial bandwidth wastage and severe interaction stalls simultaneously. In-depth investigation reveals the underlying reason, i.e., existing streaming adopts coarse-grained Forward Error Correction (FEC) encoding, without considering the adverse impact of frame length variation, which results in over-protection of large frames (i.e., bandwidth waste) and under-protection of smaller ones (i.e., interaction stalls). To remedy the problem, we propose Tooth, a per-frame adaptive FEC that aims to achieve the optimal balance between satisfactory QoE and efficient bandwidth usage. To build Tooth, we design a dual-module FEC encoding strategy, which takes full consideration of both frame length variation and network dynamics, and hence determines an appropriate FEC redundancy rate for each frame. Moreover, we also circumvent the formidable per-frame FEC computational overhead by designing a lightweight Tooth, so as to meet the rigid latency bound of real-time cloud gaming. We implement, deploy, and evaluate Tooth in the operational cloud gaming system. Extensive field tests demonstrate that Tooth significantly outperforms existing state-of-the-art FEC methods, reducing stall rates by 40.2% to 85.2%, enhancing video bitrates by 11.4% to 29.2%, and lowering bandwidth costs by 54.9% to 75.0%.

https://www.usenix.org/conference/nsdi25/presentation/an

Monday April 28, 2025 4:50pm - 5:10pm EDT
Independence Ballroom

Track 2

5:10pm EDT

AsTree: An Audio Subscription Architecture Enabling Massive-Scale Multi-Party Conferencing

Monday April 28, 2025 5:10pm - 5:30pm EDT

Independence Ballroom

Tong Meng, Wenfeng Li, Chao Yuan, Changqing Yan, and Le Zhang, ByteDance Inc.

While operating a multi-party video conferencing system (Lark) globally, we find that audio subscription alone may pose considerable challenges to the network, especially when scaling towards massive scales. Traditional strategy of subscribing to all remote participants suffers from issues such as signaling storm, excessive bandwidth and resource consumption on both server and client sides. Aimed at enhanced scalability, we share our design of AsTree, an audio subscription architecture. By a cascading tree topology and media plane-based audio selection, AsTree dramatically reduces the number of signaling messages and audio streams to forward. Practical deployment in Lark reduces audio and video stall ratios by more than 30% and 50%. We also receive 40% less negative client reviews, strongly proving the value of AsTree.

https://www.usenix.org/conference/nsdi25/presentation/meng

Monday April 28, 2025 5:10pm - 5:30pm EDT
Independence Ballroom

Track 2

9:00am EDT

Pineapple: Unifying Multi-Paxos and Atomic Shared Registers

Tuesday April 29, 2025 9:00am - 9:20am EDT

Independence Ballroom

Tigran Bantikyan, Northwestern; Jonathan Zarnstorff, unaffiliated; Te-Yen Chou, CMU; Lewis Tseng, UMass Lowell; Roberto Palmieri, Lehigh University

Linearizable storage systems reduce the complexity of developing correct large-scale customer-facing applications, in the presence of concurrent operations and failures. A common approach for providing linearizability is to use consensus to order operations invoked by applications. This paper explores designs that offload operations (from the consensus component) to improve overall performance.

This paper presents Pineapple, which uses logical timestamps to unify Multi-Paxos and atomic shared registers so that any node in the system can serve read and write operations. Compared to Multi-Paxos (or leader-based consensus), Pineapple reduces bottlenecks at the leader. Compared to Gryff, which unifies EPaxos and atomic shared registers, Pineapple has better performance because Pineapple has “non-blocking operation execution.”

Our evaluation shows that Pineapple improves both throughput and tail latency, compared to state-of-the-art systems (e.g., Gryff, Multi-Paxos, EPaxos), in both wide-area networks and local-area networks. We also integrate Pineapple with etcd. In a balanced workload, Pineapple reduces median latency by more than 50%, compared to the original system that uses an optimized version of Raft.

https://www.usenix.org/conference/nsdi25/presentation/bantikyan

Tuesday April 29, 2025 9:00am - 9:20am EDT
Independence Ballroom

Track 2

9:20am EDT

Ladder: A Convergence-based Structured DAG Blockchain for High Throughput and Low Latency

Tuesday April 29, 2025 9:20am - 9:40am EDT

Independence Ballroom

Dengcheng Hu, Jianrong Wang, Xiulong Liu, and Hao Xu, Tianjin University; Xujing Wu, Jd.Com, Inc; Muhammad Shahzad, North Carolina State University; Guyue Liu, Peking University; Keqiu Li, Tianjin University

Recent literature proposes the use of Directed Acyclic Graphs (DAG) to enhance blockchain performance. However, current block-DAG designs face three important limitations when fully utilizing parallel block processing: high computational overhead due to costly block sorting, complex transaction confirmation process, and vulnerability to balance attacks when determining the pivot chain. To this end, we propose Ladder, a structured twin-chain DAG blockchain with a convergence mechanism that efficiently optimizes parallel block processing strategy and enhances overall performance and security. In each round, a designated convergence node generates a lower-chain block, sorting the forked blocks from the upper-chain, reducing computational overhead and simplifying transaction confirmation.To counter potential adversarial disruptions, a dynamic committee is selected to generate special blocks when faulty blocks are detected. We implemented and evaluated Ladder in a distributed network environment against several state-of-the-art methods. Our results show that Ladder achieves a 59.6% increase in throughput and a 20.9% reduction in latency.

https://www.usenix.org/conference/nsdi25/presentation/hu

Tuesday April 29, 2025 9:20am - 9:40am EDT
Independence Ballroom

Track 2

9:40am EDT

Vegeta: Enabling Parallel Smart Contract Execution in Leaderless Blockchains

Tuesday April 29, 2025 9:40am - 10:00am EDT

Independence Ballroom

Tianjing Xu and Yongqi Zhong, Shanghai Jiao Tong University; Yiming Zhang, Shanghai Jiao Tong University and Shanghai Key Laboratory of Trusted Data Circulation, Governance and Web3; Ruofan Xiong, Xiamen University; Jingjing Zhang, Fudan University; Guangtao Xue and Shengyun Liu, Shanghai Jiao Tong University and Shanghai Key Laboratory of Trusted Data Circulation, Governance and Web3

Consensus and smart contract execution play complementary roles in blockchain systems. Leaderless consensus, as a promising direction in the blockchain context, can better utilize the resources of each node and/or avoid incurring the extra burden of timing assumptions. As modern Byzantine-Fault Tolerant (BFT) consensus protocols can order several hundred thousand transactions per second, contract execution is becoming the performance bottleneck. Adding concurrency to contract execution is a natural way to boost its performance, but none of the existing frameworks is a perfect fit for leaderless consensus.

We propose speculate-order-replay, a generic framework tailored to leaderless consensus protocols. Our framework allows each proposer to (pre-)process transactions prior to consensus, better utilizing its computing resources. We instantiate the framework with a concrete concurrency control protocol Vegeta. Vegeta speculatively executes a series of transactions and analyzes their dependencies before consensus, and later deterministically replays the schedule. We ran experiments under the real-world Ethereum workload on 16vCPU virtual machines. Our evaluation results show that Vegeta achieved up to 7.8× speedup compared to serial execution. When deployed on top of a leaderless consensus protocol with 10 nodes, Vegeta still achieved 6.9× speedup.

https://www.usenix.org/conference/nsdi25/presentation/xu-tianjing

Tuesday April 29, 2025 9:40am - 10:00am EDT
Independence Ballroom

Track 2

10:00am EDT

Shoal++: High Throughput DAG BFT Can Be Fast and Robust!

Tuesday April 29, 2025 10:00am - 10:20am EDT

Independence Ballroom

Balaji Arun and Zekun Li, Aptos Labs; Florian Suri-Payer, Cornell University; Sourav Das, UIUC; Alexander Spiegelman, Aptos Labs

Today's practical partially synchronous Byzantine Fault Tolerant consensus protocols trade off low latency and high throughput. On the one end, traditional BFT protocols such as PBFT and its derivatives optimize for latency. They require, in fault-free executions, only 3 message delays to commit, the optimum for BFT consensus. However, this class of protocols typically relies on a single leader, hampering throughput scalability. On the other end, a new class of so-called DAG-BFT protocols demonstrates how to achieve highly scalable throughput by separating data dissemination from consensus, and using every replica as proposer. Unfortunately, existing DAG-BFT protocols pay a steep latency premium, requiring on average 10.5 message delays to commit transactions.

This work aims to soften this tension, and proposes Shoal++, a novel DAG-based BFT consensus system that offers the throughput of DAGs while reducing end-to-end consensus commit latency to an average of 4.5 message delays. Our empirical findings are encouraging, showing that Shoal++ achieves throughput comparable to state-of-the-art DAG BFT solutions while reducing latency by up to 60%, even under less favorable network and failure conditions.

https://www.usenix.org/conference/nsdi25/presentation/arun

Tuesday April 29, 2025 10:00am - 10:20am EDT
Independence Ballroom

Track 2

10:50am EDT

HA/TCP: A Reliable and Scalable Framework for TCP Network Functions

Tuesday April 29, 2025 10:50am - 11:10am EDT

Independence Ballroom

Haoyu Gu, Ali José Mashtizadeh, and Bernard Wong, University of Waterloo

Layer 7 network functions (NFs) are a critical piece of modern network infrastructure. As a result, the scalability and reliability of these NFs are important but challenging because of the complexity of layer 7 NFs. This paper presents HA/TCP, a framework that enables migration and failover of layer 7 NFs. HA/TCP uses a novel replication mechanism to synchronize the state between replicas with low overhead, enabling seamless migration and failover of TCP connections. HA/TCP encapsulates the implementation details into our replicated socket interface to allow developers to easily add high availability to their layer 7 NFs such as WAN accelerators, load balancers, and proxies. Our benchmarks show that HA/TCP provides reliability for a 100 Gbps NF with as little as 0.2% decrease in client throughput. HA/TCP transparently migrates a connection between replicas in 38 µs, including the network latency. We provide reliability to a SOCKS proxy and a WAN accelerator with less than 2% decrease in throughput and a modest increase in CPU usage.

https://www.usenix.org/conference/nsdi25/presentation/gu

Tuesday April 29, 2025 10:50am - 11:10am EDT
Independence Ballroom

Track 2

11:10am EDT

High-level Programming for Application Networks

Tuesday April 29, 2025 11:10am - 11:30am EDT

Independence Ballroom

Xiangfeng Zhu, Yuyao Wang, Banruo Liu, Yongtong Wu, and Nikola Bojanic, University of Washington; Jingrong Chen, Duke University; Gilbert Louis Bernstein and Arvind Krishnamurthy, University of Washington; Sam Kumar, University of Washington and UCLA; Ratul Mahajan, University of Washington; Danyang Zhuo, Duke University

Application networks facilitate communication between the microservices of cloud applications. They are built today using service meshes with low-level specifications that make it difficult to express application-specific functionality (e.g., access control based on RPC fields), and they can more than double the RPC latency. We develop AppNet, a framework that makes it easy to build expressive and high-performance application networks. Developers specify rich RPC processing in a high-level language with generalized match-action rules and built-in state management. We compile the specifications to high-performance code after optimizing where (e.g., client, server) and how (e.g., RPC library, proxy) each RPC processing element runs. The optimization uses symbolic abstraction and execution to judge if different runtime configurations of possibly-stateful RPC processing elements are semantically equivalent for arbitrary RPC streams. Our experiments show that AppNet can express common application network function in only 7-28 lines of code. Its optimizations lower RPC processing latency by up to 82%.

https://www.usenix.org/conference/nsdi25/presentation/zhu

Tuesday April 29, 2025 11:10am - 11:30am EDT
Independence Ballroom

Track 2

11:30am EDT

State-Compute Replication: Parallelizing High-Speed Stateful Packet Processing

Tuesday April 29, 2025 11:30am - 11:50am EDT

Independence Ballroom

Qiongwen Xu, Rutgers University; Sebastiano Miano, Politecnico di Milano; Xiangyu Gao and Tao Wang, New York University; Adithya Murugadass and Songyuan Zhang, Rutgers University; Anirudh Sivaraman, New York University; Gianni Antichi, Queen Mary University of London and Politecnico di Milano; Srinivas Narayana, Rutgers University

With the slowdown of Moore’s law, CPU-oriented packet processing in software will be significantly outpaced by emerging line speeds of network interface cards (NICs). Single-core packet-processing throughput has saturated.

We consider the problem of high-speed packet processing with multiple CPU cores. The key challenge is state—memory that multiple packets must read and update. The prevailing method to scale throughput with multiple cores involves state sharding, processing all packets that update the same state, e.g., flow, at the same core. However, given the skewed nature of realistic flow size distributions, this method is untenable, since total throughput is limited by single-core performance.

This paper introduces state-compute replication, a principle to scale the throughput of a single stateful flow across multiple cores using replication. Our design leverages a packet history sequencer running on a NIC or top-of-the-rack switch to enable multiple cores to update state without explicit synchronization. Our experiments with realistic data center and wide-area Internet traces show that state-compute replication can scale total packet-processing throughput linearly with cores, independent of flow size distributions, across a range of realistic packet-processing programs.

https://www.usenix.org/conference/nsdi25/presentation/xu-qiongwen

Tuesday April 29, 2025 11:30am - 11:50am EDT
Independence Ballroom

Track 2

11:50am EDT

MTP: Transport for In-Network Computing

Tuesday April 29, 2025 11:50am - 12:10pm EDT

Independence Ballroom

Tao Ji, UT Austin; Rohan Vardekar and Balajee Vamanan, University of Illinois Chicago; Brent E. Stephens, Google and University of Utah; Aditya Akella, UT Austin

In-network computing (INC) is being increasingly adopted to accelerate applications by offloading part of the applications’ computation to network devices. Such application-specific (L7) offloads have several attributes that the transport protocol must work with — they may mutate, intercept, reorder and
delay application messages that span multiple packets. At the same time the transport must also work with the buffering and computation constraints of network devices hosting the L7 offloads. Existing transports and alternative approaches fall short in these regards. Therefore, we present MTP, the first transport to natively support INC. MTP is built around two major components: 1) a novel message-oriented reliability protocol and 2) a resource-specific congestion control framework. We implement a full-fledged prototype of MTP based on DPDK. We show the efficacy of MTP in a testbed with a real INC application as well as with comprehensive microbenchmarks and large-scale simulations.

https://www.usenix.org/conference/nsdi25/presentation/ji

Tuesday April 29, 2025 11:50am - 12:10pm EDT
Independence Ballroom

Track 2

2:00pm EDT

Mitigating Scalability Walls of RDMA-based Container Networks

Tuesday April 29, 2025 2:00pm - 2:20pm EDT

Independence Ballroom

Wei Liu, Tsinghua University and Alibaba Cloud; Kun Qian, Alibaba Cloud; Zhenhua Li, Tsinghua University; Feng Qian, University of Southern California; Tianyin Xu, UIUC; Yunhao Liu, Tsinghua University; Yu Guan, Shuhong Zhu, Hongfei Xu, Lanlan Xi, Chao Qin, and Ennan Zhai, Alibaba Cloud

As a state-of-the-art technique, RDMA-offloaded container networks (RCNs) can provide high-performance data communications among containers. Nevertheless, this seems to be subject to the RCN scale—when there are millions of containers simultaneously running in a data center, the performance decreases sharply and unexpectedly. In particular, we observe that most performance issues are related to RDMA NICs (RNICs), whose design and implementation defects might constitute the "scalability wall" of the RCN. To validate the conjecture, however, we are challenged by the limited visibility into the internals of today's RNICs. To address the dilemma, a more pragmatic approach is to infer the most likely causes of the performance issues according to the common abstractions of an RNIC's components and functionalities.

Specifically, we conduct combinatorial causal testing to efficiently reason about an RNIC's architecture model, effectively approximate its performance model, and thereby proactively optimize the NF (network function) offloading schedule. We embody the design into a practical system dubbed ScalaCN. Evaluation on production workloads shows that the end-to-end network bandwidth increases by 1.4× and the packet forwarding latency decreases by 31%, after resolving 82% of the causes inferred by ScalaCN. We report the performance issues of RNICs and the most likely causes to relevant vendors, all of which have been encouragingly confirmed; we are now closely working with the vendors to fix them.

https://www.usenix.org/conference/nsdi25/presentation/liu-wei

Tuesday April 29, 2025 2:00pm - 2:20pm EDT
Independence Ballroom

Track 2

2:20pm EDT

Eden: Developer-Friendly Application-Integrated Far Memory

Tuesday April 29, 2025 2:20pm - 2:40pm EDT

Independence Ballroom

Anil Yelam, Stewart Grant, and Saarth Deshpande, UC San Diego; Nadav Amit, Technion, Israel Institute of Technology; Radhika Niranjan Mysore, VMware Research Group; Amy Ousterhout, UC San Diego; Marcos K. Aguilera, VMware Research Group; Alex C. Snoeren, UC San Diego

Far memory systems are a promising way to address the resource stranding problem in datacenters. Far memory systems broadly fall into two categories. On one hand, paging-based systems use hardware guards at the granularity of pages to intercept remote accesses, which require no application changes but incur significant faulting overhead. On the other hand, app-integrated systems use software guards on data objects and apply application-specific optimizations to avoid faulting overheads, but these systems require significant application redesign and/or suffer from overhead on local accesses. We propose Eden, a new approach to far memory that combines hardware guards with a small number of software guards in the form of programmer annotations, to achieve performance similar to app-integrated systems with minimal developer effort. Eden is based on the insight that applications generate most of their page faults at a small number of code locations, and those locations are easy to find programmatically. By adding hints to such locations, Eden can notify the pager about upcoming memory accesses to customize read-ahead and memory reclamation. We show that Eden achieves 19.4–178% higher performance than Fastswap for memory-intensive applications including DataFrame and memcached. Eden achieves performance comparable to AIFM with almost 100× fewer code changes.

https://www.usenix.org/conference/nsdi25/presentation/yelam

Tuesday April 29, 2025 2:20pm - 2:40pm EDT
Independence Ballroom

Track 2

2:40pm EDT

Achieving Wire-Latency Storage Systems by Exploiting Hardware ACKs

Tuesday April 29, 2025 2:40pm - 3:00pm EDT

Independence Ballroom

Qing Wang, Jiwu Shu, Jing Wang, and Yuhao Zhang, Tsinghua University

We present Juneberry, a low-latency communication framework for storage systems. Different from existing RPC frameworks, Juneberry provides a fast path for storage requests: they can be committed with a single round trip and server CPU bypass, thus delivering extremely low latency; the execution of these requests is performed asynchronously on the server CPU. Juneberry achieves it by relying on our proposed Ordered Queue abstraction, which exploits NICs’ hardware ACKs as commit signals of requests while ensuring linearizability of the whole system. Juneberry also supports durability by placing requests in persistent memory (PM). We implement Juneberry using commodity RDMA NICs and integrate it into two storage systems: Memcached (a widely used in-memory caching system) and PMemKV (a PM-based persistent key-value store). Evaluation shows that compared with RPC, Juneberry can significantly lower their latency under write-intensive workloads.

https://www.usenix.org/conference/nsdi25/presentation/wang-qing

Tuesday April 29, 2025 2:40pm - 3:00pm EDT
Independence Ballroom

Track 2

3:00pm EDT

ODRP: On-Demand Remote Paging with Programmable RDMA

Tuesday April 29, 2025 3:00pm - 3:20pm EDT

Independence Ballroom

Zixuan Wang, Xingda Wei, Jinyu Gu, Hongrui Xie, Rong Chen, and Haibo Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University

Memory disaggregation with OS swapping is becoming popular for next-generation datacenters. RDMA is a promising technique for achieving this. However, RDMA does not support dynamic memory management in the data path. Current systems rely on RDMA’s control path operations, which are designed for coarse-grained memory management. This results in a trade-off between performance and memory utilization and also requires significant CPU usage, which is a limited resource on memory nodes.

This paper introduces On-Demand Remote Paging, the first system that smartly chains native RDMA data path primitives to offload all memory access and management operations onto the RDMA-capable NIC (RNIC). However, efficiently implementing these operations is challenging due to the limited capability of RNIC. ODRP leverages the semantics of OS swapping and adopts a client-assisted principle to address the efficiency and functionality challenges. Compared to the state-of-the-art system, ODRP can achieve significantly better memory utilization, no CPU usage while introducing only a 0.8% to 14.6% performance overhead in real-world applications.

https://www.usenix.org/conference/nsdi25/presentation/wang-zixuan

Tuesday April 29, 2025 3:00pm - 3:20pm EDT
Independence Ballroom

Track 2

3:50pm EDT

CellReplay: Towards accurate record-and-replay for cellular networks

Tuesday April 29, 2025 3:50pm - 4:10pm EDT

Independence Ballroom

William Sentosa, University of Illinois Urbana-Champaign; Balakrishnan Chandrasekaran, VU Amsterdam; P. Brighten Godfrey, University of Illinois Urbana-Champaign and Broadcom; Haitham Hassanieh, EPFL

The inherent variability of real-world cellular networks makes it hard to evaluate, reproduce, and debug the performance of networked applications running on these networks. A common approach is to record and replay a trace of observed cellular network performance. However, we show that the state-of-the-art record-and-replay technique produces empirically inaccurate results that can cause evaluation bias. This paper presents the design and implementation of CellReplay, a tool that records the time-varying performance of a live cellular network into traces using preset workloads and faithfully replays the observed performance for other workloads through an emulated network interface. The key challenge in achieving high accuracy is to replay varying network behavior in a way that captures its sensitivity to the workload. CellReplay records network behavior under two predefined workloads simultaneously and interpolates upon replay for other workloads. Across various challenging network conditions, our evaluation shows that real-world networked applications (e.g., web browsing or video streaming) running on CellReplay achieve similar performance (e.g., page load time or bitrate selection) to their live network counterparts, with significantly reduced error compared to the prior method.

https://www.usenix.org/conference/nsdi25/presentation/sentosa

Tuesday April 29, 2025 3:50pm - 4:10pm EDT
Independence Ballroom

Track 2

4:10pm EDT

Large Network UWB Localization: Algorithms and Implementation

Tuesday April 29, 2025 4:10pm - 4:30pm EDT

Independence Ballroom

Nakul Garg and Irtaza Shahid, University of Maryland, College Park; Ramanujan K Sheshadri, Nokia Bell Labs; Karthikeyan Sundaresan, Georgia Institute of Technology; Nirupam Roy, University of Maryland, College Park

Localization of networked nodes is an essential problem in emerging applications, including first-responder navigation, automated manufacturing lines, vehicular and drone navigation, asset tracking, Internet of Things, and 5G communication networks. In this paper, we present Locate3D, a novel system for peer-to-peer node localization and orientation estimation in large networks. Unlike traditional range-only methods, Locate3D introduces angle-of-arrival (AoA) data as an added network topology constraint. The system solves three key challenges: it uses angles to reduce the number of measurements required by 4× and jointly uses range and angle data for location estimation. We develop a spanning-tree approach for fast location updates, and to ensure the output graphs are rigid and uniquely realizable, even in occluded or weakly connected areas. Locate3D cuts down latency by up to 75% without compromising accuracy, surpassing standard range-only solutions. It has a 0.86 meter median localization error for building-scale multi-floor networks (32 nodes, 0 anchors) and 12.09 meters for large-scale networks (100,000 nodes, 15 anchors).

https://www.usenix.org/conference/nsdi25/presentation/garg

Tuesday April 29, 2025 4:10pm - 4:30pm EDT
Independence Ballroom

Track 2

4:30pm EDT

Towards Energy Efficient 5G vRAN Servers

Tuesday April 29, 2025 4:30pm - 4:50pm EDT

Independence Ballroom

Anuj Kalia, Microsoft; Nikita Lazarev, MIT; Leyang Xue, The University of Edinburgh; Xenofon Foukas and Bozidar Radunovic, Microsoft; Francis Y. Yan, Microsoft Research and UIUC

We study the problem of improving energy efficiency in virtualized radio access network (vRAN) servers, focusing on CPUs. Two distinct characteristics of vRAN software—strict real-time sub-millisecond deadlines and its proprietary black-box nature—preclude the use of existing general-purpose CPU energy management techniques. This paper presents RENC, a system that saves energy by adjusting CPU frequency in response to sub-second variations in cellular workloads, using the following techniques. First, despite large fluctuations in vRAN CPU load at sub-ms timescales, RENC establishes safe low-load intervals, e.g., by coupling Media Access Control (MAC) layer rate limiting with CPU frequency changes. This prevents high traffic during low-power operation, which would otherwise cause deadline misses. Second, we design techniques to compute CPU frequencies that are safe for these low-load intervals, achieved by measuring the slack in vRAN threads' deadlines using Linux eBPF hooks, or minor binary rewriting of the vRAN software. Third, we demonstrate the need to handle CPU load spikes triggered by control operations, such as new users attaching to the network. Our evaluation in a state-of-the-art vRAN testbed shows that our techniques reduces a vRAN server's CPU power consumption by up to 45% (29% server-wide).

https://www.usenix.org/conference/nsdi25/presentation/kalia

Tuesday April 29, 2025 4:30pm - 4:50pm EDT
Independence Ballroom

Track 2

4:50pm EDT

Building Massive MIMO Baseband Processing on a Single-Node Supercomputer

Tuesday April 29, 2025 4:50pm - 5:10pm EDT

Independence Ballroom

Xincheng Xie, Wentao Hou, Zerui Guo, and Ming Liu, University of Wisconsin-Madison

The rising deployment of massive MIMO coupled with the wide adoption of virtualized radio access networks (vRAN) poses an unprecedented computational demand on the baseband processing, hardly met by existing vRAN hardware substrates. The single-node supercomputer, an emerging computing platform, offers scalable computation and communication capabilities, making it a promising target to hold and run the baseband pipeline. However, realizing this is non-trivial due to the mismatch between (a) the diverse execution granularities and incongruent parallel degrees of different stages along the software processing pipeline and (b) the underlying evolving irregular hardware parallelism at runtime.

This paper closes the gap by designing and implementing MegaStation–an application-platform co-designed system that effectively harnesses the computing power of a single-node supercomputer for processing massive MIMO baseband. Our key insight is that one can adjust the execution granularity and reconstruct the baseband processing pipeline on the fly based on the monitored hardware parallelism status. Inspired by dynamic instruction scheduling, MegaStation models the single-node supercomputer as a tightly coupled microprocessor and employs a scoreboarding-like algorithm to orchestrate "baseband processing" instructions over GPU-instantiated executors. Our evaluations using the GigaIO FabreX demonstrate that MegaStation achieves up to 66.2% lower tail frame processing latency and 4× higher throughput than state-of-the-art solutions. MegaStation is a scalable and adaptive solution that can meet today’s vRAN requirements.

https://www.usenix.org/conference/nsdi25/presentation/xie

Tuesday April 29, 2025 4:50pm - 5:10pm EDT
Independence Ballroom

Track 2

5:10pm EDT

Efficient Multi-WAN Transport for 5G with OTTER

Tuesday April 29, 2025 5:10pm - 5:30pm EDT

Independence Ballroom

Mary Hogan, Oberlin College; Gerry Wan, Google; Yiming Qiu, University of Michigan; Sharad Agarwal and Ryan Beckett, Microsoft; Rachee Singh, Cornell University; Paramvir Bahl, Microsoft

In the ongoing cloudiﬁcation of 5G, software network functions (NFs) are replacing ﬁxed-function network hardware, allowing 5G network operators to leverage the beneﬁts of cloud computing. The migration of NFs and their management to the cloud causes 5G trafﬁc to traverse an operator’s wide-area network (WAN) to the cloud WAN that hosts the datacenters (DCs) running 5G NFs and applications. However, achieving end-to-end (E2E) performance for 5G trafﬁc across two WANs is hard. Placing 5G ﬂows across two WANs with different performance and reliability characteristics, edge and DC resource constraints, and interference from other ﬂows is different and more challenging than single-WAN trafﬁc engineering. We address this challenge and show that orchestrating E2E paths across a multi-WAN overlay allows us to achieve average 13% more throughput, 15% less RTT, 45% less jitter, or reduce average loss from 0.06% to under 0.001%. We implement our multi-WAN 5G ﬂow placement in a scalable optimization prototype that allocates 26%–45% more bytes on the network than greedy baselines while also satisfying the service demands of more ﬂows.

https://www.usenix.org/conference/nsdi25/presentation/hogan

Tuesday April 29, 2025 5:10pm - 5:30pm EDT
Independence Ballroom

Track 2