Gaya Nagarajan discusses the transformative impact of AI on infrastructure and networking, highlighting the rapid growth of AI products and the increasing relevance of infrastructure. She covers Meta's internal AI tools like Sevmate, the shift to large AI clusters with protocols like Rocky, and the challenges posed by Mixture of Expert (MOE) models and reasoning models. Nagarajan emphasizes the need for optionality, scale, and bold investments, citing projects like 2Africa and WaterWorth, as well as the construction of massive data center clusters like Prometheus and Hyperion. She also addresses the evolving rack ecosystem, the importance of vertical integration in physical infrastructure design, and the necessity of being nimble and flexible in the face of rapid change, including the use of innovative solutions like building "tents" for data centers. The talk concludes with a focus on tuning infrastructure for performance, managing the challenges of scale-up, distance, fault tolerance, and interoperability, and blending HPC principles with distributed systems to support the ever-shifting AI landscape.
Sign in to continue reading, translating and more.
Continue