YouTube23 Oct 2024
20m

5781 Evolving FBOSS for the Next Gen AI Fabric

Podcast cover

Open Compute Project

This podcast explores how Meta has evolved its FBOSS network management system to tackle the demands of next-generation AI workloads. The innovative solution, known as Disaggregated Scheduled Fabric (DSF), employs a credit-based scheduling approach and cell spraying across fabric links to achieve near-optimal load balancing and enhanced bandwidth. Significant upgrades were made to both the DSF SDK and FBOSS, including a new control plane for neighbor resolution and improved failure management in the distributed system. Performance tests demonstrated impressive improvements in bandwidth-heavy scenarios and a smooth response to failures, setting the stage for larger and more scalable data center architectures.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise