YouTube18 Mar 2025
31m

First Principles: Inside Zettascale OCI Superclusters for Next-Gen AI

Podcast cover

Oracle

This episode explores the engineering innovations behind Oracle Cloud Infrastructure (OCI)'s Zeta-scale Superclusters, designed for next-generation AI workloads. Against the backdrop of increasing demand for large-scale AI processing, the discussion centers on OCI's purpose-built GenAI network, leveraging RDMA technology for high throughput and low latency. More significantly, the conversation delves into the challenges of scaling RDMA to handle clusters with over 131,000 GPUs, highlighting innovations like Routable Rocky and congestion control to achieve 52 petabits of non-blocking bandwidth and latencies as low as 2 microseconds. For instance, the use of advanced traffic forwarding and a network locality service is discussed to optimize performance. The episode also addresses the unique resilience challenges posed by Gen-AI workloads, emphasizing the importance of link stabilization techniques to mitigate the impact of transient link flaps, even in a non-blocking network. Finally, the discussion concludes by explaining how OCI manages bandwidth collisions using collectives-aware load balancing, improving efficiency and reducing congestion. This showcases OCI's commitment to providing a robust and scalable infrastructure for AI and ML applications.

Outlines

Part 1: Introduction to OCI Superclusters

Part 2: Network Innovations for Gen-AI

Part 3: Network Resilience and Mitigation

Part 4: Bandwidth Management and Conclusion

Sign in to continue reading, translating and more.

Open full episode in Podwise