Loading…
2025 OCP APAC Summit
Tuesday August 5, 2025 1:40pm - 2:00pm PDT
FBOSS is Meta’s own Software Stack for managing Network Switches deployed in Meta’s data centers. It is one of the largest services in Meta in terms of the number of instances deployed.

Network Traffic in AI Fabric presents unique challenges such as “elephant flows” (a small number of extremely large, continuous flows), and low entropy (limited variation in flow characteristics, increasing likelihood of hash collisions).

At OCP 2024, we showcased how we evolved FBOSS to tackle these challenges. This solution is capable of building non-blocking clusters for up to 4K GPUs. However, generative AI use cases demand significantly larger non-blocking clusters. This can be solved by interconnecting multiple 4K GPU clusters into a single, larger cluster using traditional Routing and ECMP. In this design, intra-cluster traffic benefits from non-blocking I/O, but inter-cluster traffic continues to suffer from poor network performance due to the aforementioned elephant flows and low entropy.

In this talk, we will share our journey evolving FBOSS for generative AI workloads. We will discuss the hierarchical design that enables us to build significantly larger non-blocking clusters, the unique challenges we encountered in scaling both the dataplane and control plane, and the solutions we developed to overcome them. Additionally, we will highlight the SAI enhancements that were instrumental in adapting FBOSS to support the demands of generative AI.
Tuesday August 5, 2025 1:40pm - 2:00pm PDT
TaiNEX2 - 701 F

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link