NSDI '25: 22nd USENIX Symposium on Networked Systems Design and Implementation: Full Schedule

arrow_back View All Dates

9:00am EDT

Verifying maximum link loads in a changing world

Wednesday April 30, 2025 9:00am - 9:20am EDT

Tibor Schneider, ETH Zürich; Stefano Vissicchio, University College London; Laurent Vanbever, ETH Zürich

To meet ever more stringent requirements, network operators often need to reason about worst-case link loads. Doing so involves analyzing traffic forwarding after failures and BGP route changes. State-of-the-art systems identify failure scenarios causing congestion, but they ignore route changes.

We present Velo, the first verification system that efficiently finds maximum link loads under failures and route changes. The key building block of Velo is its ability to massively reduce the gigantic space of possible route changes thanks to (i) a router-based abstraction for route changes, (ii) a theoretical characterization of scenarios leading to worst-case link loads, and (iii) an approximation of input traffic matrices. We fully implement and extensively evaluate Velo. Velo takes only a few minutes to accurately compute all worst-case link loads in large ISP networks. It thus provides operators with critical support to robustify network configurations, improve network management and take business decisions.

https://www.usenix.org/conference/nsdi25/presentation/schneider

Wednesday April 30, 2025 9:00am - 9:20am EDT
Liberty Ballroom

Track 1

9:20am EDT

A Layered Formal Methods Approach to Answering Queue-related Queries

Wednesday April 30, 2025 9:20am - 9:40am EDT

Liberty Ballroom

Divya Raghunathan, Maria Apostolaki, and Aarti Gupta, Princeton University

Queue dynamics introduce significant uncertainty in network management tasks such as debugging, performance monitoring, and analysis. Despite numerous queue-monitoring techniques, many networks today continue to collect only per-port packet counts (e.g., using SNMP). Although queue lengths are correlated with packet counts, deriving the precise correlation between them is very challenging since packet counts do not specify many quantities (e.g., packet arrival order) which affect queue lengths.

This paper presents QuASI, a system that can answer many queue-related queries using only coarse-grained per-port packet counts. QuASI checks whether there exists a packet trace that is consistent with the packet counts and satisfies a query. To scale on large problem instances, QuASI relies on a layered approach and on a novel enqueue-rate abstraction, which is lossless for the class of queries that QuASI answers. The first layer employs a novel and efficient algorithm that generates a cover-set of abstract traces, constructs representative abstract traces from the cover-set, and efficiently checks each representative abstract trace by leveraging a known result on (0,1)-matrix existence. The first layer guarantees no false negatives: if the first layer says "No", there is no packet trace consistent with the observed packet counts that makes the query true. If it says "Yes", further verification is needed, which the second layer resolves using an SMT solver. As a result, QuASI has no false positives and no false negatives.

Our evaluations show that QuASI is up to 10⁶X faster than state-of-the-art, and can answer non-trivial queries about queue metrics (e.g., queue length) using minute-granularity packet counts. Our work is the first step toward more practical formal performance analysis under given measurements.

https://www.usenix.org/conference/nsdi25/presentation/raghunathan

Wednesday April 30, 2025 9:20am - 9:40am EDT
Liberty Ballroom

Track 1

9:40am EDT

Runtime Protocol Refinement Checking for Distributed Protocol Implementations

Wednesday April 30, 2025 9:40am - 10:00am EDT

Liberty Ballroom

Ding Ding, Zhanghan Wang, Jinyang Li, and Aurojit Panda, NYU

Despite significant progress in verifying protocols, services that implement distributed protocols (we refer to these as DPIs in what follows), e.g., Chubby or Etcd, can exhibit safety bugs in production deployments. These bugs are often introduced by programmers when converting protocol descriptions into code. This paper introduces Runtime Protocol Refinement Checking (RPRC) a runtime approach for detecting protocol implementation bugs in DPIs. RPRC systems observe a deployed DPI's runtime behavior and notify operators when this behavior evidences a protocol implementation bug, allowing operators to mitigate the bugs impact and developers to fix the bug. We have developed an algorithm for RPRC and implemented it in a system called Ellsberg that targets DPIs that assume fail-stop failures and the asynchronous (or partially synchronous) model. Our goal when designing Ellsberg was to make no assumptions about how DPIs are implemented, and to avoid additional coordination or communication. Therefore, Ellsberg builds on the observation that in the absence of Byzantine failures, a protocol safety properties are maintained if all live DPI processes correctly implement the protocol. Thus, Ellsberg checks RPRC by comparing messages sent and received by each DPI process to those produced by a simulated execution of the protocol. We apply Ellsberg to three open source DPIs, Etcd, Zookeeper and Redis Raft, and show that we can detect previously reported protocol bugs in these DPIs.

https://www.usenix.org/conference/nsdi25/presentation/ding

Wednesday April 30, 2025 9:40am - 10:00am EDT
Liberty Ballroom

Track 1

10:00am EDT

CEGS: Configuration Example Generalizing Synthesizer

Wednesday April 30, 2025 10:00am - 10:20am EDT

Liberty Ballroom

Jianmin Liu, Tsinghua University; Li Chen, Zhongguancun Laboratory; Dan Li, Tsinghua University; Yukai Miao, Zhongguancun Laboratory

Network configuration synthesis promises to increase the efficiency of network management by reducing human involvement. However, despite significant advances in this field, existing synthesizers still require much human effort in drafting configuration templates or coding in a domain-specific language. We argue that the main reason for this is that a core capability is missing for current synthesizers: identifying and following configuration examples in configuration manuals and generalizing them to arbitrary topologies.

In this work, we fill this capability gap with two recent advancements in artificial intelligence: graph neural networks (GNNs) and large language models (LLMs). We build CEGS, which can automatically identify appropriate configuration examples, follow and generalize them to fit target network scenarios. CEGS features a GNN-based Querier to identify relevant examples from device documentations, a GNN-based Classifier to generalize the example to arbitrary topology, and an efficient LLM-driven synthesis method to quickly and correctly synthesize configurations that comply with the intents. Evaluations of real-world networks and complex intents show that CEGS can automatically synthesize correct configurations for a network of 1094 devices without human involvement. In contrast, the state-of-the-art LLM-based synthesizer are more than 30 times slower than CEGS on average, even when human experts are in the loop.

https://www.usenix.org/conference/nsdi25/presentation/liu-jianmin

Wednesday April 30, 2025 10:00am - 10:20am EDT
Liberty Ballroom

Track 1

10:50am EDT

Suppressing BGP Zombies with Route Status Transparency

Wednesday April 30, 2025 10:50am - 11:10am EDT

Liberty Ballroom

Yosef Edery Anahory, The Hebrew University of Jerusalem; Jie Kong, Nicholas Scaglione, and Justin Furuness, University of Connecticut; Hemi Leibowitz, The College of Management Academic Studies; Amir Herzberg and Bing Wang, University of Connecticut; Yossi Gilad, The Hebrew University of Jerusalem

Withdrawal suppression has been a known weakness of BGP for over a decade. It has a significant detrimental impact on both the reliability and security of inter-domain routing on the Internet. This paper presents Route Status Transparency (RoST), the first design that efficiently and securely thwarts withdrawal suppression misconfigurations and attacks. RoST allows ASes to efficiently verify whether a route has been withdrawn; it is compatible with BGP as well as with BGP security enhancements. We use simulations on the Internet’s AS-level topology to evaluate the benefits from adopting RoST. We use an extensive real-world BGP announcements dataset to show that it is efficient in terms of storage, bandwidth, and computational requirements.

https://www.usenix.org/conference/nsdi25/presentation/anahory

Wednesday April 30, 2025 10:50am - 11:10am EDT
Liberty Ballroom

Track 1

11:10am EDT

ValidaTor: Domain Validation over Tor

Wednesday April 30, 2025 11:10am - 11:30am EDT

Liberty Ballroom

Jens Frieß, National Research Center for Applied Cybersecurity ATHENE and Technische Universität Darmstadt; Haya Schulmann, National Research Center for Applied Cybersecurity ATHENE and Goethe-Universität Frankfurt; Michael Waidner, National Research Center for Applied Cybersecurity ATHENE and Technische Universität Darmstadt

Domain Validation (DV) is the primary method used by Certificate Authorities (CAs) to confirm administrative control over a domain before issuing digital certificates. Despite its widespread use, DV is vulnerable to various attacks, prompting the adoption of multiple vantage points to enhance security, such as the state of the art DV mechanism supported by Let’s Encrypt. However, even distributed static vantage points remain susceptible to targeted attacks. In this paper we introduce ValidaTor, an HTTP-based domain validation system that leverages the Tor network to create a distributed and unpredictable set of validators. By utilizing Tor’s exit nodes, ValidaTor significantly increases the pool of available validators, providing high path diversity and resilience against strong adversaries. Our empirical evaluations demonstrate that ValidaTor can achieve the validation throughput of a commercial CA and
has the potential to scale to a validation volume comparable to Let’s Encrypt, while using minimal dedicated infrastructure and only a small fraction (~0.1%) of Tor’s available bandwidth. While unpredictable selection of validators makes ValidaTor fully resistant to targeted attacks on validators, we
also show the use of Tor nodes improves path diversity and thereby the resilience of DV to subversion by well-positioned ASes, reducing the number of Autonomous Systems (ASes) capable of issuing fraudulent certificates by up to 27% compared to Let’s Encrypt. Lastly, we show that the chance of subversion by malicious, colluding exit nodes is negligible (≤ 1% even with a quarter of existing exit nodes). We make the code of ValidaTor as well as the datasets and measurements publicly available for use, reproduction, and future research.

https://www.usenix.org/conference/nsdi25/presentation/friess

Wednesday April 30, 2025 11:10am - 11:30am EDT
Liberty Ballroom

Track 1

11:30am EDT

From Address Blocks to Authorized Prefixes: Redesigning RPKI ROV with a Hierarchical Hashing Scheme for Fast and Memory-Efficient Validation

Wednesday April 30, 2025 11:30am - 11:50am EDT

Liberty Ballroom

Zedong Ni, Computer Network Information Center, Chinese Academy of Sciences; and School of Cyber Science & Engineering, Southeast University; Yinbo Xu, Hui Zou, and Yanbiao Li, Computer Network Information Center, Chinese Academy of Sciences; and University of Chinese Academy of Sciences; Guang Cheng, School of Cyber Science & Engineering, Southeast University; and Purple Mountain Laboratories; Gaogang Xie, Computer Network Information Center, Chinese Academy of Sciences; and University of Chinese Academy of Sciences

Route Origin Validation (ROV) with Route Origin Authorizations (ROAs), built on top of the Resource Public Key Infrastructure (RPKI), serves as the only formally standardized and production-grade defense mechanism against route hijackings in global interdomain routing infrastructures. However, the widespread adoption of RPKI has introduced escalating scalability challenges in validating high volumes of route messages against massive ROA entries.

In this paper, we attribute the performance bottleneck of existing ROV schemes to their underlying validation model, where the route is matched against rules in the form of address blocks. To overcome this bottleneck, we propose the Authorized Prefix (AP) model that enables validation at the prefix granularity, and redesign RPKI ROV based on this new model with a hierarchical hashing scheme named h²ROV. Extensive evaluations verify h²ROV's superiority over state-of-the-art approaches in IPv4, with a speedup of $1.7× ∼ 9.8× in validation and a reduction of 49.3% ∼ 86.6% in memory consumption. System emulations using real-world network topologies further demonstrate h²ROV confines its impact to routing convergence to below 8.5% during update burst events, while reducing ROV-induced delays by 30.4% ∼ 64.7% compared to existing solutions.

https://www.usenix.org/conference/nsdi25/presentation/ni

Wednesday April 30, 2025 11:30am - 11:50am EDT
Liberty Ballroom

Track 1

11:50am EDT

PreAcher: Secure and Practical Password Pre-Authentication by Content Delivery Networks

Wednesday April 30, 2025 11:50am - 12:10pm EDT

Liberty Ballroom

Shihan Lin, Duke University; Suting Chen, Northwestern University; Yunming Xiao, University of Michigan; Yanqi Gu, University of California, Irvine; Aleksandar Kuzmanovic, Northwestern University; Xiaowei Yang, Duke University

In today's Internet, websites widely rely on password authentication for user logins. However, the intensive computation required for password authentication exposes web servers to Application-layer DoS (ADoS) attacks that exploit the login interfaces. Existing solutions fail to simultaneously prevent such ADoS attacks, preserve password secrecy, and maintain good usability. In this paper, we present PreAcher, a system architecture that incorporates third-party Content Delivery Networks (CDNs) into the password authentication process and offloads the authentication workload to CDNs without divulging the passwords to them. At the core of PreAcher is a novel three-party authentication protocol that combines Oblivious Pseudorandom Function (OPRF) and Locality-Sensitive Hashing (LSH). This protocol allows CDNs to pre-authenticate users and thus filter out ADoS traffic without compromising password security. Our evaluations demonstrate that PreAcher significantly enhances the resilience of web servers against both ADoS attacks and preserves password security while introducing acceptable overheads. Notably, PreAcher can be deployed immediately by websites alone today, without modifications to client software or CDN infrastructure. We release the source code of PreAcher to facilitate its deployment and future research.

https://www.usenix.org/conference/nsdi25/presentation/lin-shihan

Wednesday April 30, 2025 11:50am - 12:10pm EDT
Liberty Ballroom

Track 1

2:00pm EDT

ClubHeap: A High-Speed and Scalable Priority Queue for Programmable Packet Scheduling

Wednesday April 30, 2025 2:00pm - 2:20pm EDT

Liberty Ballroom

Zhikang Chen, Tsinghua University; Haoyu Song, Futurewei Technologies; Zhiyu Zhang and Yang Xu, Fudan University; Bin Liu, Tsinghua University

While PIFO is a powerful priority queue abstraction to support programmable packet scheduling in network devices, the efficient implementation of PIFO faces multiple challenges in performance and scalability. The existing solutions all fall short of certain requirements. In this paper, we propose ClubHeap to address the problem. On the one hand, we develop a novel hardware-friendly heap data structure to support faster PIFO queue operations that can schedule a flow in every clock cycle, reaching the theoretical lower bound; on the other hand, the optimized hardware architecture reduces the circuit complexity and thus enables a higher clock frequency. The end result is the best scheduling performance in its class. Combined with its inherently better scalability and flexibility, ClubHeap is an ideal solution to be built in programmable switches and SmartNICs to support various scheduling algorithms. We build an FPGA-based hardware prototype and conduct a thorough evaluation by comparing ClubHeap with the other state-of-the-art solutions. ClubHeap also allows graceful trade-offs between throughput and resource consumption through parameter adjustments, making it adaptable on different target devices.

https://www.usenix.org/conference/nsdi25/presentation/chen-zhikang

Wednesday April 30, 2025 2:00pm - 2:20pm EDT
Liberty Ballroom

Track 1

2:20pm EDT

Self-Clocked Round-Robin Packet Scheduling

Wednesday April 30, 2025 2:20pm - 2:40pm EDT

Liberty Ballroom

Erfan Sharafzadeh, Johns Hopkins University and Hewlett Packard Labs; Raymond Matson, University of California Riverside; Jean Tourrilhes and Puneet Sharma, Hewlett Packard Labs; Soudeh Ghorbani, Johns Hopkins University and Meta

Deficit Round Robin (DRR) is the de facto fair packet scheduler in the Internet due to its superior fairness and scalability. We show that DRR can perform poorly due to its assumptions about packet size distributions and traffic bursts. Concretely, DRR performs best if (1) packet size distributions are known in advance; its optimal performance depends on tuning a parameter based on the largest packet, and (2) all bursts are long and create backlogged queues. We show that neither of these assumptions holds in today's Internet: packet size distributions are varied and dynamic, complicating the tuning of DRR. Plus, Internet traffic consists of many short, latency-sensitive flows, creating small bursts. These flows can experience high latency under DRR as it serves a potentially large number of flows in a round-robin fashion.

To address these shortcomings while retaining the fairness and scalability of DRR, we introduce Self-Clocked Round-Robin Scheduling (SCRR), a parameter-less, low-latency, and scalable packet scheduler that boosts short latency-sensitive flows through careful adjustments to their virtual times without violating their fair share guarantees. We evaluate SCRR using theoretical models and a Linux implementation on a physical testbed. Our results demonstrate that while performing on an equal footing with DRR on achieving flow fairness, SCRR reduces the average CPU overhead by 23% compared to DRR with a small quantum while improving the application latency by 71% compared to DRR with a large quantum.

https://www.usenix.org/conference/nsdi25/presentation/sharafzadeh

Wednesday April 30, 2025 2:20pm - 2:40pm EDT
Liberty Ballroom

Track 1

2:40pm EDT

Everything Matters in Programmable Packet Scheduling

Wednesday April 30, 2025 2:40pm - 3:00pm EDT

Liberty Ballroom

Albert Gran Alcoz, ETH Zürich; Balázs Vass, BME-TMIT; Pooria Namyar, USC; Behnaz Arzani, Microsoft Research; Gábor Rétvári, BME-TMIT; Laurent Vanbever, ETH Zürich

Operators can deploy any scheduler they desire on existing switches through programmable packet schedulers: they tag packets with ranks (which indicate their priority) and schedule them in the order of these ranks. The ideal programmable scheduler is the Push-In First-Out (PIFO) queue, which schedules packets in a perfectly sorted order by "pushing" packets into any position of the queue based on their ranks. However, it is hard to implement PIFO queues in hardware due to their need to sort packets at line rate (based on their ranks).

Recent proposals approximate PIFO behaviors on existing data-planes. While promising, they fail to simultaneously capture both of the necessary behaviors of PIFO queues: their scheduling behavior and admission control. We introduce PACKS, an approximate PIFO scheduler that addresses this problem. PACKS runs on top of a set of priority queues and uses packet-rank information and queue-occupancy levels during enqueue to determine whether to admit each incoming packet and to which queue it should be mapped.

We fully implement PACKS in P4 and evaluate it on real workloads. We show that PACKS better-approximates PIFO than state-of-the-art approaches. Specifically, PACKS reduces the rank inversions by up to 7× and 15× with respect to SP-PIFO and AIFO, and the number of packet drops by up to 60% compared to SP-PIFO. Under pFabric ranks, PACKS reduces the mean FCT across small flows by up to 33% and 2.6×, compared to SP-PIFO and AIFO. We also show that PACKS runs at line rate on existing hardware (Intel Tofino).

https://www.usenix.org/conference/nsdi25/presentation/alcoz

Wednesday April 30, 2025 2:40pm - 3:00pm EDT
Liberty Ballroom

Track 1

3:00pm EDT

When P4 Meets Run-to-completion Architecture

Wednesday April 30, 2025 3:00pm - 3:20pm EDT

Liberty Ballroom

Hao Zheng, State Key Laboratory for Novel Software Technology, Nanjing University, China; Xin Yan, Huawei, China; Wenbo Li, Jiaqi Zheng, and Xiaoliang Wang, State Key Laboratory for Novel Software Technology, Nanjing University, China; Qingqing Zhao, Luyou He, Xiaofei Lai, Feng Gao, and Fuguang Huang, Huawei, China; Wanchun Dou, Guihai Chen, and Chen Tian, State Key Laboratory for Novel Software Technology, Nanjing University, China

P4 programmable data planes have significantly accelerated the evolution of various network technologies. Although the P4 language has gained wide acceptance, its further development encounters two obstacles: limited programmability and the cessation of the next-generation Tofino chip. As a hardware manufacturer, we try to address the above dilemmas by opening the P4 programmability of our run-to-completion (RTC) chips. At present, there is no publicly available experience in this field. We introduce P4RTC, a comprehensive consolidation of our experiences applying the P4 language to RTC architecture. P4RTC introduces a new P4 architecture model and a set of beneficial extern constructs to fully leverage the RTC architecture’s programmability. Besides, we share the insights we have gained from designing and implementing compilers. We also provide a performance model to facilitate profiling P4RTC’s performance on user-customized P4 code. We prototype P4RTC on an RTC chip with 1.2 Tbps bandwidth. Case-oriented evaluation demonstrates that P4RTC can enhance P4 programmability and reduce the burdens of RTC development. The performance model can provide substantial insights into optimizing P4RTC programs.

https://www.usenix.org/conference/nsdi25/presentation/zheng-hao

Wednesday April 30, 2025 3:00pm - 3:20pm EDT
Liberty Ballroom

Track 1

3:50pm EDT

Mutant: Learning Congestion Control from Existing Protocols via Online Reinforcement Learning

Wednesday April 30, 2025 3:50pm - 4:10pm EDT

Liberty Ballroom

Lorenzo Pappone, Computer Science Department, Saint Louis University; Alessio Sacco, DAUIN, Politecnico di Torino; Flavio Esposito, Computer Science Department, Saint Louis University

Learning how to control congestion remains a challenge despite years of progress. Existing congestion control protocols have demonstrated efficacy within specific network conditions, inevitably behaving suboptimally or poorly in others. Machine learning solutions to congestion control have been proposed, though relying on extensive training and specific network configurations. In this paper, we loosen such dependencies by proposing Mutant, an online reinforcement learning algorithm for congestion control that adapts to the behavior of the best-performing schemes, outperforming them in most network conditions. Design challenges included determining the best protocols to learn from, given a network scenario, and creating a system able to evolve to accommodate future protocols with minimal changes. Our evaluation on real-world and emulated scenarios shows that Mutant achieves lower delays and higher throughput than prior learning-based schemes while maintaining fairness by exhibiting negligible harm to competing flows, making it robust across diverse and dynamic network conditions.

https://www.usenix.org/conference/nsdi25/presentation/pappone

Wednesday April 30, 2025 3:50pm - 4:10pm EDT
Liberty Ballroom

Track 1

4:10pm EDT

CATO: End-to-End Optimization of ML-Based Traffic Analysis Pipelines

Wednesday April 30, 2025 4:10pm - 4:30pm EDT

Liberty Ballroom

Gerry Wan, Stanford University; Shinan Liu, University of Chicago; Francesco Bronzino, ENS Lyon; Nick Feamster, University of Chicago; Zakir Durumeric, Stanford University

Machine learning has shown tremendous potential for improving the capabilities of network traffic analysis applications, often outperforming simpler rule-based heuristics. However, ML-based solutions remain difficult to deploy in practice. Many existing approaches only optimize the predictive performance of their models, overlooking the practical challenges of running them against network traffic in real time. This is especially problematic in the domain of traffic analysis, where the efficiency of the serving pipeline is a critical factor in determining the usability of a model. In this work, we introduce CATO, a framework that addresses this problem by jointly optimizing the predictive performance and the associated systems costs of the serving pipeline. CATO leverages recent advances in multi-objective Bayesian optimization to efficiently identify Pareto-optimal configurations, and automatically compiles end-to-end optimized serving pipelines that can be deployed in real networks. Our evaluations show that compared to popular feature optimization techniques, CATO can provide up to 3600× lower inference latency and 3.7× higher zero-loss throughput while simultaneously achieving better model performance.

https://www.usenix.org/conference/nsdi25/presentation/wan-gerry

Wednesday April 30, 2025 4:10pm - 4:30pm EDT
Liberty Ballroom

Track 1

4:30pm EDT

Resolving Packets from Counters: Enabling Multi-scale Network Traffic Super Resolution via Composable Large Traffic Model

Wednesday April 30, 2025 4:30pm - 4:50pm EDT

Liberty Ballroom

Xizheng Wang, Tsinghua University and Zhongguancun Laboratory; Libin Liu and Li Chen, Zhongguancun Laboratory; Dan Li, Tsinghua University; Yukai Miao and Yu Bai, Zhongguancun Laboratory

Realistic fine-grained traffic traces are valuable to numerous applications in both academia and industry. However, obtaining them directly from devices is significantly challenging, while coarse-grained counters are readily available on almost all network devices. None of existing work can restore fine-grained traffic traces from counters, which we call network traffic super-resolution (TSR). To this end, we propose ZOOMSYNTH, the first TSR system that can achieve packet-level trace synthesis with counter traces as input. Following the basic structure of the TSR task, we design the Granular Traffic Transformer (GTT) model and the Composable Large Traffic Model (CLTM). CLTM is a tree of GTT models, and the GTT models in each layer perform upscaling on a particular granularity, which allows each GTT model to capture the traffic characteristics at this resolution. Using CLTM, we synthesize fine-grained traces from counters. We also leverage a rule-following model to comprehend counter rules (e.g. ACLs) when available, guiding the generations of fine-grained traces. We implement ZOOMSYNTH and perform extensive evaluations. Results show that, with only second-level counter traces, ZOOMSYNTH achieves synthesis quality comparable to existing solutions that takes packet-level traces as input. CLTM can also be fine-tuned to support downstream tasks. For example, ZOOMSYNTH with fine-tuned CLTM outperforms the existing solution by 27.5% and 9.8% in anomaly detection and service recognition tasks, respectively. To promote future research, we release the pre-trained CLTM-1.8B model weights along with its source code.

https://www.usenix.org/conference/nsdi25/presentation/wang-xizheng-resolving

Wednesday April 30, 2025 4:30pm - 4:50pm EDT
Liberty Ballroom

Track 1

4:50pm EDT

BFTBrain: Adaptive BFT Consensus with Reinforcement Learning

Wednesday April 30, 2025 4:50pm - 5:10pm EDT

Liberty Ballroom

Chenyuan Wu and Haoyun Qin, University of Pennsylvania; Mohammad Javad Amiri, Stony Brook University; Boon Thau Loo, University of Pennsylvania; Dahlia Malkhi, UC Santa Barbara; Ryan Marcus, University of Pennsylvania

This paper presents BFTBrain, a reinforcement learning (RL) based Byzantine fault-tolerant (BFT) system that provides significant operational benefits: a plug-and-play system suitable for a broad set of hardware and network configurations, and adjusts effectively in real-time to changing fault scenarios and workloads. BFTBrain adapts to system conditions and application needs by switching between a set of BFT protocols in real-time. Two main advances contribute to BFTBrain’s agility and performance. First, BFTBrain is based on a systematic, thorough modeling of metrics that correlate the performance of the studied BFT protocols with varying fault scenarios and workloads. These metrics are fed as features to BFTBrain’s RL engine in order to choose the best-performing BFT protocols in real-time. Second, BFTBrain coordinates RL in a decentralized manner which is resilient to adversarial data pollution, where nodes share local metering values and reach the same learning output by consensus. As a result, in addition to providing significant operational benefits, BFTBrain improves throughput over fixed protocols by 18% to 119% under dynamic conditions and outperforms state-of-the-art learning based approaches by 44% to 154%.

https://www.usenix.org/conference/nsdi25/presentation/wu-chenyuan

Wednesday April 30, 2025 4:50pm - 5:10pm EDT
Liberty Ballroom

Track 1