Rohit Puri and Hany Morsy from Meta discuss the evolution of Meta's AI network infrastructure, focusing on the challenges and solutions encountered while scaling up to support generative AI. They detail the transition from single AI zones to multi-building clusters, highlighting the architecture of the 24k and 100k+ GPU clusters, and address issues like network congestion, latency, and traffic management. They also touch on the upcoming Prometheus project, a super cluster spanning a metropolitan area, and the challenges of training models over large distances with heterogeneous hardware.
Sign in to continue reading, translating and more.
Continue