Loading…
arrow_back View All Dates
Tuesday, April 29
 

8:00am EDT

Continental Breakfast
Tuesday April 29, 2025 8:00am - 9:00am EDT
Tuesday April 29, 2025 8:00am - 9:00am EDT
Liberty Ballroom Foyer

8:00am EDT

Badge Pickup
Tuesday April 29, 2025 8:00am - 5:00pm EDT
Tuesday April 29, 2025 8:00am - 5:00pm EDT
Liberty Ballroom Foyer

9:00am EDT

AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training
Tuesday April 29, 2025 9:00am - 9:20am EDT
Guanbin Xu, Zhihao Le, Yinhe Chen, Zhiqi Lin, and Zewen Jin, University of Science and Technology of China; Youshan Miao, Microsoft Research; Cheng Li, University of Science and Technology of China; Anhui Province Key Laboratory of Biomedical Imaging and Intelligent Processing; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center


The collective communication libraries are pivotal in optimizing the performance of distributed and parallel deep neural network (DNN) training. Most network optimizations are under the assumption that these libraries are well-tuned, ignoring their low-level parameter selection. In this paper, we present a novel automated tuning method AutoCCL that significantly improves communication performance without incurring additional costs. One of the primary challenges we tackle is the state explosion in searching for the optimal configuration. To overcome this, we decouple implementation-related parameters from those sensitive to the search space size and propose a divide-and-conquer algorithm, minimizing the requirement for exhaustive trials. We further propose an online tuning approach that accounts for communication-computation interference to enhance accuracy in finding optimal configurations, while hiding tuning overhead within early iterations of training jobs. We implement AutoCCL atop NCCL, a leading and widely-used communication library provided by NVIDIA. Our evaluation on both a 2-node cluster (16 A40 GPUs, intra-node NVLink, inter-node 2× 400Gbps InfiniBand) and a 4-node cluster (32 A40 GPUs, intra-node PCIe, inter-node 100Gbps InfiniBand) demonstrates that AutoCCL achieves 1.24-1.29× and 1.15-1.22× speedups on microbenchmarks compared to NCCL and another SOTA NCCL tuner, respectively, and up to 1.80× and 1.49× with concurrent computation. End-to-end evaluations on three large language models and one vision model show 1.07-1.32× improvements in periteration training time.


https://www.usenix.org/conference/nsdi25/presentation/xu-guanbin
Tuesday April 29, 2025 9:00am - 9:20am EDT
Liberty Ballroom

9:00am EDT

Pineapple: Unifying Multi-Paxos and Atomic Shared Registers
Tuesday April 29, 2025 9:00am - 9:20am EDT
Tigran Bantikyan, Northwestern; Jonathan Zarnstorff, unaffiliated; Te-Yen Chou, CMU; Lewis Tseng, UMass Lowell; Roberto Palmieri, Lehigh University


Linearizable storage systems reduce the complexity of developing correct large-scale customer-facing applications, in the presence of concurrent operations and failures. A common approach for providing linearizability is to use consensus to order operations invoked by applications. This paper explores designs that offload operations (from the consensus component) to improve overall performance.

This paper presents Pineapple, which uses logical timestamps to unify Multi-Paxos and atomic shared registers so that any node in the system can serve read and write operations. Compared to Multi-Paxos (or leader-based consensus), Pineapple reduces bottlenecks at the leader. Compared to Gryff, which unifies EPaxos and atomic shared registers, Pineapple has better performance because Pineapple has “non-blocking operation execution.”

Our evaluation shows that Pineapple improves both throughput and tail latency, compared to state-of-the-art systems (e.g., Gryff, Multi-Paxos, EPaxos), in both wide-area networks and local-area networks. We also integrate Pineapple with etcd. In a balanced workload, Pineapple reduces median latency by more than 50%, compared to the original system that uses an optimized version of Raft.


https://www.usenix.org/conference/nsdi25/presentation/bantikyan
Tuesday April 29, 2025 9:00am - 9:20am EDT
Independence Ballroom

9:20am EDT

OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud
Tuesday April 29, 2025 9:20am - 9:40am EDT
Ertza Warraich, Purdue University; Omer Shabtai and Khalid Manaa, Nvidia; Shay Vargaftik, VMware Research; Yonatan Piasetzky and Matty Kadosh, Nvidia; Lalith Suresh, Feldera; Muhammad Shahbaz, University of Michigan


We present OptiReduce, a new collective-communication system for the cloud with bounded, predictable completion times for deep-learning jobs in the presence of varying computation (stragglers) and communication (congestion and gradient drops) variabilities. OptiReduce exploits the inherent resiliency and the stochastic nature of distributed deep-learning (DDL) training and fine-tuning to work with approximated (or lost) gradients—providing an efficient balance between (tail) performance and the resulting accuracy of the trained models.

Exploiting this domain-specific characteristic of DDL, OptiReduce introduces (1) mechanisms (e.g., unreliable bounded transport with adaptive timeout) to improve the DDL jobs’ tail execution time, and (2) strategies (e.g., Transpose AllReduce and Hadamard Transform) to mitigate the impact of gradient drops on model accuracy. Our evaluation shows that OptiReduce achieves 70% and 30% faster time-to-accuracy (TTA), on average, when operating in shared, cloud environments (e.g., CloudLab) compared to Gloo and NCCL, respectively.


https://www.usenix.org/conference/nsdi25/presentation/warraich
Tuesday April 29, 2025 9:20am - 9:40am EDT
Liberty Ballroom

9:20am EDT

Ladder: A Convergence-based Structured DAG Blockchain for High Throughput and Low Latency
Tuesday April 29, 2025 9:20am - 9:40am EDT
Dengcheng Hu, Jianrong Wang, Xiulong Liu, and Hao Xu, Tianjin University; Xujing Wu, Jd.Com, Inc; Muhammad Shahzad, North Carolina State University; Guyue Liu, Peking University; Keqiu Li, Tianjin University


Recent literature proposes the use of Directed Acyclic Graphs (DAG) to enhance blockchain performance. However, current block-DAG designs face three important limitations when fully utilizing parallel block processing: high computational overhead due to costly block sorting, complex transaction confirmation process, and vulnerability to balance attacks when determining the pivot chain. To this end, we propose Ladder, a structured twin-chain DAG blockchain with a convergence mechanism that efficiently optimizes parallel block processing strategy and enhances overall performance and security. In each round, a designated convergence node generates a lower-chain block, sorting the forked blocks from the upper-chain, reducing computational overhead and simplifying transaction confirmation.To counter potential adversarial disruptions, a dynamic committee is selected to generate special blocks when faulty blocks are detected. We implemented and evaluated Ladder in a distributed network environment against several state-of-the-art methods. Our results show that Ladder achieves a 59.6% increase in throughput and a 20.9% reduction in latency.


https://www.usenix.org/conference/nsdi25/presentation/hu
Tuesday April 29, 2025 9:20am - 9:40am EDT
Independence Ballroom

9:40am EDT

Efficient Direct-Connect Topologies for Collective Communications
Tuesday April 29, 2025 9:40am - 10:00am EDT
Liangyu Zhao, University of Washington; Siddharth Pal, Raytheon BBN Technologies; Tapan Chugh, University of Washington; Weiyang Wang, MIT CSAIL; Jason Fantl, Prithwish Basu, and Joud Khoury, Raytheon BBN Technologies; Arvind Krishnamurthy, University of Washington


We consider the problem of distilling efficient network topologies for collective communications. We provide an algorithmic framework for constructing direct-connect topologies optimized for the latency vs. bandwidth trade-off associated with the workload. Our approach synthesizes many different topologies and communication schedules for a given cluster size and degree, then identifies the best option for a given workload. Our algorithms start from small, optimal base topologies and associated schedules, using techniques that can be iteratively applied to derive much larger topologies and schedules. Additionally, we incorporate well-studied large-scale graph topologies into our algorithmic framework by producing efficient communication schedules for them using a novel polynomial-time algorithm. Our evaluation uses multiple testbeds and large-scale simulations to demonstrate significant performance benefits from our derived topologies and schedules.


https://www.usenix.org/conference/nsdi25/presentation/zhao-liangyu
Tuesday April 29, 2025 9:40am - 10:00am EDT
Liberty Ballroom

9:40am EDT

Vegeta: Enabling Parallel Smart Contract Execution in Leaderless Blockchains
Tuesday April 29, 2025 9:40am - 10:00am EDT
Tianjing Xu and Yongqi Zhong, Shanghai Jiao Tong University; Yiming Zhang, Shanghai Jiao Tong University and Shanghai Key Laboratory of Trusted Data Circulation, Governance and Web3; Ruofan Xiong, Xiamen University; Jingjing Zhang, Fudan University; Guangtao Xue and Shengyun Liu, Shanghai Jiao Tong University and Shanghai Key Laboratory of Trusted Data Circulation, Governance and Web3


Consensus and smart contract execution play complementary roles in blockchain systems. Leaderless consensus, as a promising direction in the blockchain context, can better utilize the resources of each node and/or avoid incurring the extra burden of timing assumptions. As modern Byzantine-Fault Tolerant (BFT) consensus protocols can order several hundred thousand transactions per second, contract execution is becoming the performance bottleneck. Adding concurrency to contract execution is a natural way to boost its performance, but none of the existing frameworks is a perfect fit for leaderless consensus.

We propose speculate-order-replay, a generic framework tailored to leaderless consensus protocols. Our framework allows each proposer to (pre-)process transactions prior to consensus, better utilizing its computing resources. We instantiate the framework with a concrete concurrency control protocol Vegeta. Vegeta speculatively executes a series of transactions and analyzes their dependencies before consensus, and later deterministically replays the schedule. We ran experiments under the real-world Ethereum workload on 16vCPU virtual machines. Our evaluation results show that Vegeta achieved up to 7.8× speedup compared to serial execution. When deployed on top of a leaderless consensus protocol with 10 nodes, Vegeta still achieved 6.9× speedup.


https://www.usenix.org/conference/nsdi25/presentation/xu-tianjing
Tuesday April 29, 2025 9:40am - 10:00am EDT
Independence Ballroom

10:00am EDT

SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
Tuesday April 29, 2025 10:00am - 10:20am EDT
Alind Khare and Dhruv Garg, Georgia Institute of Technology; Sukrit Kalra, UC Berkeley; Snigdha Grandhi, Adobe; Ion Stoica, UC Berkeley; Alexey Tumanov, Georgia Institute of Technology


The increasing deployment of ML models on the critical path of production applications requires ML inference serving systems to serve these models under unpredictable and bursty request arrival rates. Serving many models under such conditions requires a careful balance between each application’s latency and accuracy requirements and the overall efficiency of utilization of scarce resources. Faced with this tension, state-of-the-art systems either choose a single model representing a static point in the latency-accuracy tradeoff space to serve all requests or incur latency target violations by loading specific models on the critical path of request serving. Our work instead resolves this tension through a resource-efficient serving of the entire range of models spanning the latency-accuracy tradeoff space. Our novel mechanism, SubNetAct, achieves this by carefully inserting specialized control-flow operators in pre-trained, weight-shared super-networks. These operators enable SubNetAct to dynamically route a request through the network to actuate a specific model that meets the request’s latency and accuracy target. Thus, SubNetAct can serve a vastly higher number of models than prior systems while requiring upto 2.6× lower memory. More crucially, SubNetAct’s near-instantaneous actuation of a wide-range of models unlocks the design space of fine-grained, reactive scheduling policies. We design one such extremely effective policy, SlackFit, and instantiate both SubNetAct and SlackFit in a real system, SuperServe. On real-world traces derived from a Microsoft workload, SuperServe achieves 4.67% higher accuracy for the same latency targets and 2.85× higher latency target attainment for the same accuracy.


https://www.usenix.org/conference/nsdi25/presentation/khare
Tuesday April 29, 2025 10:00am - 10:20am EDT
Liberty Ballroom

10:00am EDT

Shoal++: High Throughput DAG BFT Can Be Fast and Robust!
Tuesday April 29, 2025 10:00am - 10:20am EDT
Balaji Arun and Zekun Li, Aptos Labs; Florian Suri-Payer, Cornell University; Sourav Das, UIUC; Alexander Spiegelman, Aptos Labs


Today's practical partially synchronous Byzantine Fault Tolerant consensus protocols trade off low latency and high throughput. On the one end, traditional BFT protocols such as PBFT and its derivatives optimize for latency. They require, in fault-free executions, only 3 message delays to commit, the optimum for BFT consensus. However, this class of protocols typically relies on a single leader, hampering throughput scalability. On the other end, a new class of so-called DAG-BFT protocols demonstrates how to achieve highly scalable throughput by separating data dissemination from consensus, and using every replica as proposer. Unfortunately, existing DAG-BFT protocols pay a steep latency premium, requiring on average 10.5 message delays to commit transactions.

This work aims to soften this tension, and proposes Shoal++, a novel DAG-based BFT consensus system that offers the throughput of DAGs while reducing end-to-end consensus commit latency to an average of 4.5 message delays. Our empirical findings are encouraging, showing that Shoal++ achieves throughput comparable to state-of-the-art DAG BFT solutions while reducing latency by up to 60%, even under less favorable network and failure conditions.


https://www.usenix.org/conference/nsdi25/presentation/arun
Tuesday April 29, 2025 10:00am - 10:20am EDT
Independence Ballroom

10:20am EDT

Coffee and Tea Break
Tuesday April 29, 2025 10:20am - 10:50am EDT
Tuesday April 29, 2025 10:20am - 10:50am EDT
Liberty Ballroom Foyer

10:50am EDT

Learning Production-Optimized Congestion Control Selection for Alibaba Cloud CDN
Tuesday April 29, 2025 10:50am - 11:10am EDT
Xuan Zeng, Alibaba Cloud; Haoran Xu, Sun Yat-sen University; Chen Chen and Xumiao Zhang, Alibaba Cloud; Xiaoxi Zhang and Xu Chen, Sun Yat-sen University; Guihai Chen, Nanjing University; Yubing Qiu, Yiping Zhang, Chong Hao, and Ennan Zhai, Alibaba Cloud


Today's content delivery networks (CDNs) typically use static congestion control (CC) configurations, yet the diverse network environments preclude a universally optimal CC for all geographical regions, as evidenced by our extensive measurements. Current CC algorithms, limited by narrow applicability or high maintenance costs, struggle in large-scale CDNs. This work introduces AliCCS, the first CC Selection (CCS) approach tailored for production CDN, integrating fine-grained domain knowledge for learning to choose the best CC from existing, well-established ones. Through an over-one-year real-world deployment in Alibaba Cloud CDN, AliCCS has enhanced the Quality-of-Experience (QoE) by up to 9.31%, surpassing the competitive margin in the CDN market, and significantly reduced the retransmission rate by 25.51% to 174.36% across all provinces of China, leading to cost savings over 10 million US dollars. We also share key insights and experiences from deploying AliCCS at scale, highlighting traffic patterns in Alibaba Cloud CDN.


https://www.usenix.org/conference/nsdi25/presentation/zeng
Tuesday April 29, 2025 10:50am - 11:10am EDT
Liberty Ballroom

10:50am EDT

HA/TCP: A Reliable and Scalable Framework for TCP Network Functions
Tuesday April 29, 2025 10:50am - 11:10am EDT
Haoyu Gu, Ali José Mashtizadeh, and Bernard Wong, University of Waterloo


Layer 7 network functions (NFs) are a critical piece of modern network infrastructure. As a result, the scalability and reliability of these NFs are important but challenging because of the complexity of layer 7 NFs. This paper presents HA/TCP, a framework that enables migration and failover of layer 7 NFs. HA/TCP uses a novel replication mechanism to synchronize the state between replicas with low overhead, enabling seamless migration and failover of TCP connections. HA/TCP encapsulates the implementation details into our replicated socket interface to allow developers to easily add high availability to their layer 7 NFs such as WAN accelerators, load balancers, and proxies. Our benchmarks show that HA/TCP provides reliability for a 100 Gbps NF with as little as 0.2% decrease in client throughput. HA/TCP transparently migrates a connection between replicas in 38 µs, including the network latency. We provide reliability to a SOCKS proxy and a WAN accelerator with less than 2% decrease in throughput and a modest increase in CPU usage.


https://www.usenix.org/conference/nsdi25/presentation/gu
Tuesday April 29, 2025 10:50am - 11:10am EDT
Independence Ballroom

11:10am EDT

GPU-Disaggregated Serving for Deep Learning Recommendation Models at Scale
Tuesday April 29, 2025 11:10am - 11:30am EDT
Lingyun Yang, Hong Kong University of Science and Technology; Yongchen Wang and Yinghao Yu, Alibaba Group; Qizhen Weng, Hong Kong University of Science and Technology; Jianbo Dong, Kan Liu, Chi Zhang, Yanyi Zi, Hao Li, Zechao Zhang, Nan Wang, Yu Dong, Menglei Zheng, Lanlan Xi, Xiaowei Lu, Liang Ye, Guodong Yang, Binzhang Fu, Tao Lan, Liping Zhang, and Lin Qu, Alibaba Group; Wei Wang, Hong Kong University of Science and Technology


Online recommender systems use deep learning recommendation models (DLRMs) to provide accurate, personalized recommendations to improve customer experience. However, efficiently provisioning DLRM services at scale is challenging. DLRMs exhibit distinct resource usage patterns: they require a large number of CPU cores and a tremendous amount of memory, but only a small number of GPUs. Running them in multi-GPU servers quickly exhausts the servers' CPU and memory resources, leaving a large number of unallocated GPUs stranded, unable to utilize by other tasks.

This paper describes Prism, a production DLRM serving system that eliminates GPU fragmentation by means of resource disaggregation. In Prism, a fleet of CPU nodes (CNs) interconnect with a cluster of heterogeneous GPU nodes (HNs) through RDMA, leading to two disaggregated resource pools that can independently scale. Prism automatically divides DLRMs into CPU- and GPU-intensive subgraphs and schedules them on CNs and HNs for disaggregated serving. Prism employs various techniques to minimize the latency overhead caused by disaggregation, including optimal graph partitioning, topology-aware resource management, and SLO-aware communication scheduling. Evaluations show that Prism effectively reduces CPU and GPU fragmentation by 53% and 27% in a crowded GPU cluster. During seasonal promotion events, it efficiently enables capacity loaning from training clusters, saving over 90% of GPUs. Prism has been deployed in production clusters for over two years and now runs on over 10k GPUs.


https://www.usenix.org/conference/nsdi25/presentation/yang
Tuesday April 29, 2025 11:10am - 11:30am EDT
Liberty Ballroom

11:10am EDT

High-level Programming for Application Networks
Tuesday April 29, 2025 11:10am - 11:30am EDT
Xiangfeng Zhu, Yuyao Wang, Banruo Liu, Yongtong Wu, and Nikola Bojanic, University of Washington; Jingrong Chen, Duke University; Gilbert Louis Bernstein and Arvind Krishnamurthy, University of Washington; Sam Kumar, University of Washington and UCLA; Ratul Mahajan, University of Washington; Danyang Zhuo, Duke University


Application networks facilitate communication between the microservices of cloud applications. They are built today using service meshes with low-level specifications that make it difficult to express application-specific functionality (e.g., access control based on RPC fields), and they can more than double the RPC latency. We develop AppNet, a framework that makes it easy to build expressive and high-performance application networks. Developers specify rich RPC processing in a high-level language with generalized match-action rules and built-in state management. We compile the specifications to high-performance code after optimizing where (e.g., client, server) and how (e.g., RPC library, proxy) each RPC processing element runs. The optimization uses symbolic abstraction and execution to judge if different runtime configurations of possibly-stateful RPC processing elements are semantically equivalent for arbitrary RPC streams. Our experiments show that AppNet can express common application network function in only 7-28 lines of code. Its optimizations lower RPC processing latency by up to 82%.


https://www.usenix.org/conference/nsdi25/presentation/zhu
Tuesday April 29, 2025 11:10am - 11:30am EDT
Independence Ballroom

11:30am EDT

Evolution of Aegis: Fault Diagnosis for AI Model Training Service in Production
Tuesday April 29, 2025 11:30am - 11:50am EDT
Jianbo Dong, Kun Qian, Pengcheng Zhang, Zhilong Zheng, Liang Chen, Fei Feng, Yichi Xu, Yikai Zhu, Gang Lu, Xue Li, Zhihui Ren, Zhicheng Wang, Bin Luo, Peng Zhang, Yang Liu, Yanqing Chen, Yu Guan, Weicheng Wang, Chaojie Yang, Yang Zhang, Man Yuan, Hanyu Zhao, Yong Li, Zihan Zhao, Shan Li, Xianlong Zeng, Zhiping Yao, Binzhang Fu, Ennan Zhai, Wei Lin, Chao Wang, and Dennis Cai, Alibaba Cloud


Despite the success of diagnosis systems in traditional cloud computing, these systems are not suitable for pinpointing faults in AI model training cloud scenarios due to the differences in computing paradigms between traditional cloud computing and model training. As one of the largest cloud providers, we present Aegis, a fault diagnosis system specifically designed for AI model training service. We share our experience in the motivation, design, and evolution of Aegis. Keeping easy-to-deploy as the primary principle, Aegis Phase- 1 started by enhancing existing general-purpose diagnosis systems. After several months of evolution, Aegis Phase-2 cogitatively chose to customize the collective communication library for sophisticated failure localization in runtime without modifying customer code. Besides the failure localization, we further equipped Aegis with the capabilities on handling performance degradation and failure checking before delivery. Aegis has been deployed in our production training cloud service for one year. Aegis decreases more than 97% of the idle time wasted by diagnosis, 84% of the training task restart count and 71% of the performance degradation.


https://www.usenix.org/conference/nsdi25/presentation/dong
Tuesday April 29, 2025 11:30am - 11:50am EDT
Liberty Ballroom

11:30am EDT

State-Compute Replication: Parallelizing High-Speed Stateful Packet Processing
Tuesday April 29, 2025 11:30am - 11:50am EDT
Qiongwen Xu, Rutgers University; Sebastiano Miano, Politecnico di Milano; Xiangyu Gao and Tao Wang, New York University; Adithya Murugadass and Songyuan Zhang, Rutgers University; Anirudh Sivaraman, New York University; Gianni Antichi, Queen Mary University of London and Politecnico di Milano; Srinivas Narayana, Rutgers University


With the slowdown of Moore’s law, CPU-oriented packet processing in software will be significantly outpaced by emerging line speeds of network interface cards (NICs). Single-core packet-processing throughput has saturated.

We consider the problem of high-speed packet processing with multiple CPU cores. The key challenge is state—memory that multiple packets must read and update. The prevailing method to scale throughput with multiple cores involves state sharding, processing all packets that update the same state, e.g., flow, at the same core. However, given the skewed nature of realistic flow size distributions, this method is untenable, since total throughput is limited by single-core performance.

This paper introduces state-compute replication, a principle to scale the throughput of a single stateful flow across multiple cores using replication. Our design leverages a packet history sequencer running on a NIC or top-of-the-rack switch to enable multiple cores to update state without explicit synchronization. Our experiments with realistic data center and wide-area Internet traces show that state-compute replication can scale total packet-processing throughput linearly with cores, independent of flow size distributions, across a range of realistic packet-processing programs.


https://www.usenix.org/conference/nsdi25/presentation/xu-qiongwen
Tuesday April 29, 2025 11:30am - 11:50am EDT
Independence Ballroom

11:50am EDT

PAPAYA Federated Analytics Stack: Engineering Privacy, Scalability and Practicality
Tuesday April 29, 2025 11:50am - 12:10pm EDT
Harish Srinivas, Graham Cormode, Mehrdad Honarkhah, Samuel Lurye, Jonathan Hehir, Lunwen He, George Hong, Ahmed Magdy, Dzmitry Huba, Kaikai Wang, Shen Guo, and Shoubhik Bhattacharya, Meta


Cross-device Federated Analytics (FA) is a distributed computation paradigm designed to answer analytics queries about and derive insights from data held locally on users’ devices. On-device computations combined with other privacy and security measures ensure that only minimal data is transmitted off-device, achieving a high standard of data protection. Despite FA’s broad adoption, the applicability of existing FA systems is limited by compromised accuracy; lack of flexibility for data analytics; and an inability to scale effectively. In this paper, we describe our approach to combine privacy, scalability, and practicality to build a system that overcomes these limitations. The PAPAYA system at Meta system leverages trusted execution environments (TEEs) and optimizes the use of on-device computing resources to facilitate federated data processing across large fleets of devices, while ensuring robust, defensible, and verifiable privacy safeguards. We focus on federated analytics (statistics and monitoring), in contrast to systems for federated learning (ML workloads), and we flag the key differences.


https://www.usenix.org/conference/nsdi25/presentation/srinivas
Tuesday April 29, 2025 11:50am - 12:10pm EDT
Liberty Ballroom

11:50am EDT

MTP: Transport for In-Network Computing
Tuesday April 29, 2025 11:50am - 12:10pm EDT
Tao Ji, UT Austin; Rohan Vardekar and Balajee Vamanan, University of Illinois Chicago; Brent E. Stephens, Google and University of Utah; Aditya Akella, UT Austin


In-network computing (INC) is being increasingly adopted to accelerate applications by offloading part of the applications’ computation to network devices. Such application-specific (L7) offloads have several attributes that the transport protocol must work with — they may mutate, intercept, reorder and
delay application messages that span multiple packets. At the same time the transport must also work with the buffering and computation constraints of network devices hosting the L7 offloads. Existing transports and alternative approaches fall short in these regards. Therefore, we present MTP, the first transport to natively support INC. MTP is built around two major components: 1) a novel message-oriented reliability protocol and 2) a resource-specific congestion control framework. We implement a full-fledged prototype of MTP based on DPDK. We show the efficacy of MTP in a testbed with a real INC application as well as with comprehensive microbenchmarks and large-scale simulations.


https://www.usenix.org/conference/nsdi25/presentation/ji
Tuesday April 29, 2025 11:50am - 12:10pm EDT
Independence Ballroom

12:10pm EDT

Lunch (on your own)
Tuesday April 29, 2025 12:10pm - 2:00pm EDT
Tuesday April 29, 2025 12:10pm - 2:00pm EDT
-

2:00pm EDT

ONCache: A Cache-Based Low-Overhead Container Overlay Network
Tuesday April 29, 2025 2:00pm - 2:20pm EDT
Shengkai Lin, Shizhen Zhao, Peirui Cao, and Xinchi Han, Shanghai Jiao Tong University; Quan Tian, Wenfeng Liu, Qi Wu, and Donghai Han, Broadcom; Xinbing Wang, Shanghai Jiao Tong University
Recent years have witnessed a widespread adoption of containers. While containers simplify and accelerate application development, existing container network technologies either incur significant overhead, which hurts performance for distributed applications, or lose flexibility or compatibility, which hinders the widespread deployment in production.
We carefully analyze the kernel data path of an overlay network, quantifying the time consumed by each segment of the data path and identifying the extra overhead in an overlay network compared to bare metal. We observe that this extra overhead generates repetitive results among packets, which inspires us to introduce caches within an overlay network.
We design and implement ONCache (Overlay Network Cache), a cache-based container overlay network, to eliminate the extra overhead while maintaining flexibility and compatibility. We implement ONCache using the extended Berkeley Packet Filter (eBPF) with only 524 lines of code, and integrate it as a plugin of Antrea. With ONCache, containers attain networking performance akin to that of bare metal. Compared to the standard overlay networks, ONCache improves throughput and request-response transaction rate by 12% and 36% for TCP (20% and 34% for UDP), respectively, while significantly reducing per-packet CPU overhead. Popular distributed applications also benefit from ONCache.
https://www.usenix.org/conference/nsdi25/presentation/lin-shengkai
Tuesday April 29, 2025 2:00pm - 2:20pm EDT
Liberty Ballroom

2:00pm EDT

Mitigating Scalability Walls of RDMA-based Container Networks
Tuesday April 29, 2025 2:00pm - 2:20pm EDT
Wei Liu, Tsinghua University and Alibaba Cloud; Kun Qian, Alibaba Cloud; Zhenhua Li, Tsinghua University; Feng Qian, University of Southern California; Tianyin Xu, UIUC; Yunhao Liu, Tsinghua University; Yu Guan, Shuhong Zhu, Hongfei Xu, Lanlan Xi, Chao Qin, and Ennan Zhai, Alibaba Cloud


As a state-of-the-art technique, RDMA-offloaded container networks (RCNs) can provide high-performance data communications among containers. Nevertheless, this seems to be subject to the RCN scale—when there are millions of containers simultaneously running in a data center, the performance decreases sharply and unexpectedly. In particular, we observe that most performance issues are related to RDMA NICs (RNICs), whose design and implementation defects might constitute the "scalability wall" of the RCN. To validate the conjecture, however, we are challenged by the limited visibility into the internals of today's RNICs. To address the dilemma, a more pragmatic approach is to infer the most likely causes of the performance issues according to the common abstractions of an RNIC's components and functionalities.

Specifically, we conduct combinatorial causal testing to efficiently reason about an RNIC's architecture model, effectively approximate its performance model, and thereby proactively optimize the NF (network function) offloading schedule. We embody the design into a practical system dubbed ScalaCN. Evaluation on production workloads shows that the end-to-end network bandwidth increases by 1.4× and the packet forwarding latency decreases by 31%, after resolving 82% of the causes inferred by ScalaCN. We report the performance issues of RNICs and the most likely causes to relevant vendors, all of which have been encouragingly confirmed; we are now closely working with the vendors to fix them.


https://www.usenix.org/conference/nsdi25/presentation/liu-wei
Tuesday April 29, 2025 2:00pm - 2:20pm EDT
Independence Ballroom

2:20pm EDT

GREEN: Carbon-efficient Resource Scheduling for Machine Learning Clusters
Tuesday April 29, 2025 2:20pm - 2:40pm EDT
Kaiqiang Xu and Decang Sun, iSING Lab, Hong Kong University of Science and Technology; Han Tian, USTC; Junxue Zhang and Kai Chen, iSING Lab, Hong Kong University of Science and Technology


This paper explores the problem of scheduling machine Learning (ML) jobs while also taking into account the reduction of carbon emissions in the cluster. Traditional cluster schedulers for ML jobs mainly focus on optimizing job completion time~(JCT), but do not consider the environmental impact of their decisions, resulting in a suboptimal carbon footprint. To address this issue, we propose GREEN, an ML cluster scheduler that is both time-efficient and carbon-efficient. At its core, GREEN uses a unique carbon-aware scheduling algorithm that reduces carbon footprint with minimized impact on JCT.
Additionally, it leverages the temporal flexibility of ML jobs to reduce carbon emissions by shifting workloads to less carbon-intensive times, while still maintaining overall daily capacity. Our experiments using real ML jobs workload demonstrate that GREEN can achieve up to 41.2% reduction in cluster-wide carbon footprint and 12% reduction in peak power consumption, while incurring 3.6%-5.9% time efficiency tradeoff compared to existing methods.


https://www.usenix.org/conference/nsdi25/presentation/xu-kaiqiang
Tuesday April 29, 2025 2:20pm - 2:40pm EDT
Liberty Ballroom

2:20pm EDT

Eden: Developer-Friendly Application-Integrated Far Memory
Tuesday April 29, 2025 2:20pm - 2:40pm EDT
Anil Yelam, Stewart Grant, and Saarth Deshpande, UC San Diego; Nadav Amit, Technion, Israel Institute of Technology; Radhika Niranjan Mysore, VMware Research Group; Amy Ousterhout, UC San Diego; Marcos K. Aguilera, VMware Research Group; Alex C. Snoeren, UC San Diego


Far memory systems are a promising way to address the resource stranding problem in datacenters. Far memory systems broadly fall into two categories. On one hand, paging-based systems use hardware guards at the granularity of pages to intercept remote accesses, which require no application changes but incur significant faulting overhead. On the other hand, app-integrated systems use software guards on data objects and apply application-specific optimizations to avoid faulting overheads, but these systems require significant application redesign and/or suffer from overhead on local accesses. We propose Eden, a new approach to far memory that combines hardware guards with a small number of software guards in the form of programmer annotations, to achieve performance similar to app-integrated systems with minimal developer effort. Eden is based on the insight that applications generate most of their page faults at a small number of code locations, and those locations are easy to find programmatically. By adding hints to such locations, Eden can notify the pager about upcoming memory accesses to customize read-ahead and memory reclamation. We show that Eden achieves 19.4–178% higher performance than Fastswap for memory-intensive applications including DataFrame and memcached. Eden achieves performance comparable to AIFM with almost 100× fewer code changes.


https://www.usenix.org/conference/nsdi25/presentation/yelam
Tuesday April 29, 2025 2:20pm - 2:40pm EDT
Independence Ballroom

2:40pm EDT

The Benefits and Limitations of User Interrupts for Preemptive Userspace Scheduling
Tuesday April 29, 2025 2:40pm - 3:00pm EDT
Linsong Guo, Danial Zuberi, Tal Garfinkel, and Amy Ousterhout, UC San Diego


Preemptive scheduling promises to mitigate head-of-line blocking and enable flexible scheduling while retaining a simple programming model. Despite this, preemption is underutilized in server-side software today. Instead, high-performance datacenter systems and language runtimes often rely on cooperative concurrency, or else use preemption only at very coarse timescales, limiting its effectiveness. A key reason that preemption is underutilized today is that existing preemption mechanisms have high and unpredictable overheads.

Intel recently introduced support for user interrupts, a new feature that offers an opportunity to change this. By enabling interrupts to be sent and received entirely in user space, user interrupts can significantly lower the overhead of preemption. In this paper, we shed light on how user interrupts impact the landscape of preemption mechanisms. We build two user-level schedulers that leverage user interrupts for low-overhead preemption. We find that user interrupts are not a panacea. For example, they provide limited benefits when other software layers constrain the kinds of scheduling policies that can be used. Still, user interrupts can match or exceed the performance of existing mechanisms for all but the highest preemption rates, while achieving much more consistent overheads and retaining a user friendly programming model.


https://www.usenix.org/conference/nsdi25/presentation/guo
Tuesday April 29, 2025 2:40pm - 3:00pm EDT
Liberty Ballroom

2:40pm EDT

Achieving Wire-Latency Storage Systems by Exploiting Hardware ACKs
Tuesday April 29, 2025 2:40pm - 3:00pm EDT
Qing Wang, Jiwu Shu, Jing Wang, and Yuhao Zhang, Tsinghua University


We present Juneberry, a low-latency communication framework for storage systems. Different from existing RPC frameworks, Juneberry provides a fast path for storage requests: they can be committed with a single round trip and server CPU bypass, thus delivering extremely low latency; the execution of these requests is performed asynchronously on the server CPU. Juneberry achieves it by relying on our proposed Ordered Queue abstraction, which exploits NICs’ hardware ACKs as commit signals of requests while ensuring linearizability of the whole system. Juneberry also supports durability by placing requests in persistent memory (PM). We implement Juneberry using commodity RDMA NICs and integrate it into two storage systems: Memcached (a widely used in-memory caching system) and PMemKV (a PM-based persistent key-value store). Evaluation shows that compared with RPC, Juneberry can significantly lower their latency under write-intensive workloads.


https://www.usenix.org/conference/nsdi25/presentation/wang-qing
Tuesday April 29, 2025 2:40pm - 3:00pm EDT
Independence Ballroom

3:00pm EDT

Securing Public Cloud Networks with Efficient Role-based Micro-Segmentation
Tuesday April 29, 2025 3:00pm - 3:20pm EDT
Sathiya Kumaran Mani and Kevin Hsieh, Microsoft; Santiago Segarra, Rice University; Ranveer Chandra, Microsoft; Yajie Zhou, University of Maryland; Srikanth Kandula, Microsoft


Securing network traffic within data centers is a critical and daunting challenge due to the increasing complexity and scale of modern public clouds. Micro-segmentation offers a promising solution by implementing fine-grained, workload-specific network security policies to mitigate potential attacks. However, the dynamic nature and large scale of deployments present significant obstacles in crafting precise security policies, limiting the practicality of this approach. To address these challenges, we introduce a novel system that efficiently processes vast volumes of network-flow logs and effectively infers the roles of network endpoints. Our method integrates domain knowledge and communication patterns in a principled manner, facilitating the creation of micro-segmentation policies at a large scale. Evaluations with real-world deployment demonstrate that our solution significantly surpasses existing algorithms in role inference accuracy. We implement our solution as an end-to-end system, and we show that our system is up to 21.5× more cost-efficient than Apache Flink, a widely used open-source stream processing system.


https://www.usenix.org/conference/nsdi25/presentation/mani
Tuesday April 29, 2025 3:00pm - 3:20pm EDT
Liberty Ballroom

3:00pm EDT

ODRP: On-Demand Remote Paging with Programmable RDMA
Tuesday April 29, 2025 3:00pm - 3:20pm EDT
Zixuan Wang, Xingda Wei, Jinyu Gu, Hongrui Xie, Rong Chen, and Haibo Chen, Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University


Memory disaggregation with OS swapping is becoming popular for next-generation datacenters. RDMA is a promising technique for achieving this. However, RDMA does not support dynamic memory management in the data path. Current systems rely on RDMA’s control path operations, which are designed for coarse-grained memory management. This results in a trade-off between performance and memory utilization and also requires significant CPU usage, which is a limited resource on memory nodes.

This paper introduces On-Demand Remote Paging, the first system that smartly chains native RDMA data path primitives to offload all memory access and management operations onto the RDMA-capable NIC (RNIC). However, efficiently implementing these operations is challenging due to the limited capability of RNIC. ODRP leverages the semantics of OS swapping and adopts a client-assisted principle to address the efficiency and functionality challenges. Compared to the state-of-the-art system, ODRP can achieve significantly better memory utilization, no CPU usage while introducing only a 0.8% to 14.6% performance overhead in real-world applications.


https://www.usenix.org/conference/nsdi25/presentation/wang-zixuan
Tuesday April 29, 2025 3:00pm - 3:20pm EDT
Independence Ballroom

3:20pm EDT

Coffee and Tea Break
Tuesday April 29, 2025 3:20pm - 3:50pm EDT
Tuesday April 29, 2025 3:20pm - 3:50pm EDT
Liberty Ballroom Foyer

3:50pm EDT

Understanding and Profiling NVMe-over-TCP Using ntprof
Tuesday April 29, 2025 3:50pm - 4:10pm EDT
Yuyuan Kang and Ming Liu, University of Wisconsin-Madison


NVMe-over-TCP (NVMe/TCP) is an emerging remote storage protocol, increasingly adopted in enterprises and clouds. It establishes a high-performance reliable data channel between clients and storage targets to deliver block I/Os. Understanding and analyzing the protocol execution details and how well storage workloads run atop are pivotal for system developers and infrastructure engineers. However, our community lacks such a profiling utility, whereas existing solutions are ad-hoc, tedious, and heuristic-driven. Realizing it is challenging due to the unpredictable I/O workload profile, intricate system layer interaction, and deep execution pipeline.

This paper presents ntprof, a systematic, informative, and lightweight NVMe/TCP profiler. Our key idea is to view the NVMe/TCP storage substrate as a lossless switched network and apply network monitoring techniques. We model each on-path system module as a software switch, equip it with a programmable profiling agent on the data plane, and develop a proactive query interface for statistics collection and analysis. ntprof, comprising a kernel module and a user-space utility, allows developers to define various profiling tasks, incurs marginal overhead when co-locating with applications, and generates performance reports based on prescribed specifications. We build ntprof atop Linux kernel 5.15.143 and apply it in six cases, i.e., end-to-end latency breakdown, interference analysis, SW/HW bottleneck localization, and application performance diagnostic. ntprof is available at https://github.com/netlab-wisconsin/ntprof.


https://www.usenix.org/conference/nsdi25/presentation/kang
Tuesday April 29, 2025 3:50pm - 4:10pm EDT
Liberty Ballroom

3:50pm EDT

CellReplay: Towards accurate record-and-replay for cellular networks
Tuesday April 29, 2025 3:50pm - 4:10pm EDT
William Sentosa, University of Illinois Urbana-Champaign; Balakrishnan Chandrasekaran, VU Amsterdam; P. Brighten Godfrey, University of Illinois Urbana-Champaign and Broadcom; Haitham Hassanieh, EPFL


The inherent variability of real-world cellular networks makes it hard to evaluate, reproduce, and debug the performance of networked applications running on these networks. A common approach is to record and replay a trace of observed cellular network performance. However, we show that the state-of-the-art record-and-replay technique produces empirically inaccurate results that can cause evaluation bias. This paper presents the design and implementation of CellReplay, a tool that records the time-varying performance of a live cellular network into traces using preset workloads and faithfully replays the observed performance for other workloads through an emulated network interface. The key challenge in achieving high accuracy is to replay varying network behavior in a way that captures its sensitivity to the workload. CellReplay records network behavior under two predefined workloads simultaneously and interpolates upon replay for other workloads. Across various challenging network conditions, our evaluation shows that real-world networked applications (e.g., web browsing or video streaming) running on CellReplay achieve similar performance (e.g., page load time or bitrate selection) to their live network counterparts, with significantly reduced error compared to the prior method.


https://www.usenix.org/conference/nsdi25/presentation/sentosa
Tuesday April 29, 2025 3:50pm - 4:10pm EDT
Independence Ballroom

4:10pm EDT

Building an Elastic Block Storage over EBOFs Using Shadow Views
Tuesday April 29, 2025 4:10pm - 4:30pm EDT
Sheng Jiang, Carnegie Mellon University; Ming Liu, University of Wisconsin-Madison


The EBOF (Ethernet-Bunch-Of-Flash) has emerged as an enticing and promising disaggregated storage platform due to its streamlined I/O processing, high scalability, and substantial energy/cost-efficiency improvement. An EBOF applies a smart-sender dumb-receiver design philosophy and provides backward-compatible storage volumes to expedite system deployment. Yet, the static and opaque internal I/O processing pipeline lacks resource allocation, I/O scheduling, and traffic orchestration capabilities, entailing bandwidth waste, workload non-adaptiveness, and performance interference.

This paper presents the design and implementation of a distributed telemetry system (called shadow view) to tackle the above challenges and facilitate the effective use of an EBOF. We model an EBOF as a two-layer multi-switch architecture and develop a view development protocol to construct the EBOF running snapshot and expose internal execution statistics at runtime. Our design is motivated by the observation that fast data center networks make the overheads of inter-server communication and synchronization negligible. We demonstrate the effectiveness of shadow view by building a block storage (dubbed Flint) atop EBOFs. The enhanced I/O data plane allows us to develop new three techniques–an elastic volume manager, a view-enabled bandwidth auction mechanism, and an eIO scheduler. Our evaluations using the Fungible FS1600 EBOF show that a Flint volume achieves 9.3/9.2 GB/s read/write bandwidth with no latency degradation, significantly outperforming the defacto EBOF volume. It achieves up to 2.9× throughput improvements when running an object store. Flint is tenant-aware and remote target-aware, delivering efficient multi-tenancy and workload adaptiveness.


https://www.usenix.org/conference/nsdi25/presentation/jiang
Tuesday April 29, 2025 4:10pm - 4:30pm EDT
Liberty Ballroom

4:10pm EDT

Large Network UWB Localization: Algorithms and Implementation
Tuesday April 29, 2025 4:10pm - 4:30pm EDT
Nakul Garg and Irtaza Shahid, University of Maryland, College Park; Ramanujan K Sheshadri, Nokia Bell Labs; Karthikeyan Sundaresan, Georgia Institute of Technology; Nirupam Roy, University of Maryland, College Park


Localization of networked nodes is an essential problem in emerging applications, including first-responder navigation, automated manufacturing lines, vehicular and drone navigation, asset tracking, Internet of Things, and 5G communication networks. In this paper, we present Locate3D, a novel system for peer-to-peer node localization and orientation estimation in large networks. Unlike traditional range-only methods, Locate3D introduces angle-of-arrival (AoA) data as an added network topology constraint. The system solves three key challenges: it uses angles to reduce the number of measurements required by 4× and jointly uses range and angle data for location estimation. We develop a spanning-tree approach for fast location updates, and to ensure the output graphs are rigid and uniquely realizable, even in occluded or weakly connected areas. Locate3D cuts down latency by up to 75% without compromising accuracy, surpassing standard range-only solutions. It has a 0.86 meter median localization error for building-scale multi-floor networks (32 nodes, 0 anchors) and 12.09 meters for large-scale networks (100,000 nodes, 15 anchors).


https://www.usenix.org/conference/nsdi25/presentation/garg
Tuesday April 29, 2025 4:10pm - 4:30pm EDT
Independence Ballroom

4:30pm EDT

Pushing the Limits of In-Network Caching for Key-Value Stores
Tuesday April 29, 2025 4:30pm - 4:50pm EDT
Gyuyeong Kim, Sungshin Women's University


We present OrbitCache, a new in-network caching architecture that can cache variable-length items to balance a wide range of key-value workloads. Unlike existing works, OrbitCache does not cache hot items in the switch memory. Instead, we make hot items revisit the switch data plane continuously by exploiting packet recirculation. Our approach keeps cached key-value pairs in the switch data plane while freeing them from item size limitations caused by hardware constraints. We implement an OrbitCache prototype on an Intel Tofino switch. Our experimental results show that OrbitCache can balance highly skewed workloads and is robust to various system conditions.


https://www.usenix.org/conference/nsdi25/presentation/kim
Tuesday April 29, 2025 4:30pm - 4:50pm EDT
Liberty Ballroom

4:30pm EDT

Towards Energy Efficient 5G vRAN Servers
Tuesday April 29, 2025 4:30pm - 4:50pm EDT
Anuj Kalia, Microsoft; Nikita Lazarev, MIT; Leyang Xue, The University of Edinburgh; Xenofon Foukas and Bozidar Radunovic, Microsoft; Francis Y. Yan, Microsoft Research and UIUC


We study the problem of improving energy efficiency in virtualized radio access network (vRAN) servers, focusing on CPUs. Two distinct characteristics of vRAN software—strict real-time sub-millisecond deadlines and its proprietary black-box nature—preclude the use of existing general-purpose CPU energy management techniques. This paper presents RENC, a system that saves energy by adjusting CPU frequency in response to sub-second variations in cellular workloads, using the following techniques. First, despite large fluctuations in vRAN CPU load at sub-ms timescales, RENC establishes safe low-load intervals, e.g., by coupling Media Access Control (MAC) layer rate limiting with CPU frequency changes. This prevents high traffic during low-power operation, which would otherwise cause deadline misses. Second, we design techniques to compute CPU frequencies that are safe for these low-load intervals, achieved by measuring the slack in vRAN threads' deadlines using Linux eBPF hooks, or minor binary rewriting of the vRAN software. Third, we demonstrate the need to handle CPU load spikes triggered by control operations, such as new users attaching to the network. Our evaluation in a state-of-the-art vRAN testbed shows that our techniques reduces a vRAN server's CPU power consumption by up to 45% (29% server-wide).


https://www.usenix.org/conference/nsdi25/presentation/kalia
Tuesday April 29, 2025 4:30pm - 4:50pm EDT
Independence Ballroom

4:50pm EDT

Building Massive MIMO Baseband Processing on a Single-Node Supercomputer
Tuesday April 29, 2025 4:50pm - 5:10pm EDT
Xincheng Xie, Wentao Hou, Zerui Guo, and Ming Liu, University of Wisconsin-Madison


The rising deployment of massive MIMO coupled with the wide adoption of virtualized radio access networks (vRAN) poses an unprecedented computational demand on the baseband processing, hardly met by existing vRAN hardware substrates. The single-node supercomputer, an emerging computing platform, offers scalable computation and communication capabilities, making it a promising target to hold and run the baseband pipeline. However, realizing this is non-trivial due to the mismatch between (a) the diverse execution granularities and incongruent parallel degrees of different stages along the software processing pipeline and (b) the underlying evolving irregular hardware parallelism at runtime.

This paper closes the gap by designing and implementing MegaStation–an application-platform co-designed system that effectively harnesses the computing power of a single-node supercomputer for processing massive MIMO baseband. Our key insight is that one can adjust the execution granularity and reconstruct the baseband processing pipeline on the fly based on the monitored hardware parallelism status. Inspired by dynamic instruction scheduling, MegaStation models the single-node supercomputer as a tightly coupled microprocessor and employs a scoreboarding-like algorithm to orchestrate "baseband processing" instructions over GPU-instantiated executors. Our evaluations using the GigaIO FabreX demonstrate that MegaStation achieves up to 66.2% lower tail frame processing latency and 4× higher throughput than state-of-the-art solutions. MegaStation is a scalable and adaptive solution that can meet today’s vRAN requirements.


https://www.usenix.org/conference/nsdi25/presentation/xie
Tuesday April 29, 2025 4:50pm - 5:10pm EDT
Independence Ballroom

5:10pm EDT

Efficient Multi-WAN Transport for 5G with OTTER
Tuesday April 29, 2025 5:10pm - 5:30pm EDT
Mary Hogan, Oberlin College; Gerry Wan, Google; Yiming Qiu, University of Michigan; Sharad Agarwal and Ryan Beckett, Microsoft; Rachee Singh, Cornell University; Paramvir Bahl, Microsoft


In the ongoing cloudification of 5G, software network functions (NFs) are replacing fixed-function network hardware, allowing 5G network operators to leverage the benefits of cloud computing. The migration of NFs and their management to the cloud causes 5G traffic to traverse an operator’s wide-area network (WAN) to the cloud WAN that hosts the datacenters (DCs) running 5G NFs and applications. However, achieving end-to-end (E2E) performance for 5G traffic across two WANs is hard. Placing 5G flows across two WANs with different performance and reliability characteristics, edge and DC resource constraints, and interference from other flows is different and more challenging than single-WAN traffic engineering. We address this challenge and show that orchestrating E2E paths across a multi-WAN overlay allows us to achieve average 13% more throughput, 15% less RTT, 45% less jitter, or reduce average loss from 0.06% to under 0.001%. We implement our multi-WAN 5G flow placement in a scalable optimization prototype that allocates 26%–45% more bytes on the network than greedy baselines while also satisfying the service demands of more flows.


https://www.usenix.org/conference/nsdi25/presentation/hogan
Tuesday April 29, 2025 5:10pm - 5:30pm EDT
Independence Ballroom

6:00pm EDT

NSDI '25 Poster Session and Reception
Tuesday April 29, 2025 6:00pm - 7:30pm EDT
Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Enjoy dinner, drinks, and the chance to connect with other attendees, authors, and symposium organizers.
Tuesday April 29, 2025 6:00pm - 7:30pm EDT
Franklin Hall A

7:30pm EDT

Meta Sponsor Event
Tuesday April 29, 2025 7:30pm - 8:30pm EDT
Join Meta to learn more about current projects and enjoy refreshments while networking with other attendees.
Sponsors
Tuesday April 29, 2025 7:30pm - 8:30pm EDT
Franklin Hall 5-6
 
  • Filter By Date
  • Filter By Venue
  • Filter By Type
  • Timezone

Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.
Filtered by Date -