Ron He and Ankur Singh discuss the challenges of scaling AI networks, particularly with the advent of Gen-AI, and introduce Disaggregated Schedule Fabric (DSF) as a solution. They explain how traditional IP fabrics struggle with elephant flows, low entropy, and suboptimal capacity utilization, which led to the development of DSF. DSF disaggregates line cards to build a larger, logically large switch, using credit-based congestion control and packet fragmentation for optimal fabric link usage. They detail the architecture of DSF for Gen-AI applications, including AI zones, scaling units, and a dual-stage spine, and how it scales to 18K GPUs across multiple buildings. They also address failure management through Balanced Input Mode, which balances traffic in the network during link failures. Finally, they share performance results and future plans, including interconnecting multiple DC regions and developing HyperPorts.
Sign in to continue reading, translating and more.
Continue