Loading…
Monday April 28, 2025 2:20pm - 2:40pm EDT
Hamid Hajabdolali Bazzaz, Yingjie Bi, and Weiwu Pang, Google; Minlan Yu, Harvard University; Ramesh Govindan, University of Southern California; Neal Cardwell, Nandita Dukkipati, Meng-Jung Tsai, Chris DeForeest, and Yuxue Jin, Google; Charles Carver, Columbia University; Jan Kopański, Liqun Cheng, and Amin Vahdat, Google


Datacenter network hotspots, defined as links with persistently high utilization, can lead to performance bottlenecks.In this work, we study hotspots in Google’s datacenter networks. We find that these hotspots occur most frequently at ToR switches and can persist for hours. They are caused mainly by bandwidth demand-supply imbalance, largely due to high demand from network-intensive services, or demand exceeding available bandwidth when compute/storage upgrades outpace ToR bandwidth upgrades. Compounding this issue is bandwidth-independent task/data placement by data-center compute and storage schedulers. We quantify the performance impact of hotspots, and find that they can degrade the end-to-end latency of some distributed applications by over 2× relative to low utilization levels. Finally, we describe simple improvements we deployed. In our cluster scheduler, adding hotspot-aware task placement reduced the number of hot ToRs by 90%; in our distributed file system, adding hotspot-aware data placement reduced p95 network latency by more than 50%. While congestion control, load balancing, and traffic engineering can efficiently utilize paths for a fixed placement, we find hotspot-aware placement – placing tasks and data under ToRs with higher available bandwidth – is crucial for achieving consistently good performance.


https://www.usenix.org/conference/nsdi25/presentation/bazzaz
Monday April 28, 2025 2:20pm - 2:40pm EDT
Liberty Ballroom

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link