The podcast explores Meta's journey in scaling AI networks using Desegregated Scheduled Fabric (DSF) to meet the demands of GenAI. It addresses challenges with traditional IP fabrics, such as elephant flows, low entropy, and suboptimal capacity utilization, which led to the development of DSF. DSF disaggregates line cards and fabric cards into individual devices, creating a logically large switch. The discussion covers the architecture of DSF AI zones, highlighting how they support traffic patterns within scaling units and across larger clusters, scaling up to data center regions. Input Balance Mode is presented as a solution for handling link failures by balancing input and output capacity. Performance results demonstrate DSF's ability to handle bandwidth-intensive collectives and real training jobs effectively. Future developments include connecting multiple data center regions and hyperports.
Sign in to continue reading, translating and more.
Continue