First Principles: Inside Zettascale OCI Superclusters for Next-Gen AI

This episode explores the engineering innovations behind Oracle Cloud Infrastructure (OCI)'s Zeta-scale Superclusters, designed for next-generation AI workloads. Against the backdrop of increasing demand for large-scale AI processing, the discussion centers on OCI's purpose-built GenAI network, leveraging RDMA technology for high throughput and low latency. More significantly, the conversation delves into the challenges of scaling RDMA to handle clusters with over 131,000 GPUs, highlighting innovations like Routable Rocky and congestion control to achieve 52 petabits of non-blocking bandwidth and latencies as low as 2 microseconds. For instance, the use of advanced traffic forwarding and a network locality service is discussed to optimize performance. The episode also addresses the unique resilience challenges posed by Gen-AI workloads, emphasizing the importance of link stabilization techniques to mitigate the impact of transient link flaps, even in a non-blocking network. Finally, the discussion concludes by explaining how OCI manages bandwidth collisions using collectives-aware load balancing, improving efficiency and reducing congestion. This showcases OCI's commitment to providing a robust and scalable infrastructure for AI and ML applications.

Outlines

Part 1: Introduction to OCI Superclusters

Part 2: Network Innovations for Gen-AI

Part 3: Network Resilience and Mitigation

Part 4: Bandwidth Management and Conclusion

Sign in to continue reading, translating and more.

Open full episode in Podwise

Oracle

Part 1: Introduction to OCI Superclusters

Introduction: Zeta-scale OCI Superclusters and AI Innovations

Building the Cluster Network for Zeta-scale GPU Clusters

RDMA at Scale: Rocky V2 and Addressing Challenges

Part 2: Network Innovations for Gen-AI

Gen-AI Specific Requirements and Network Innovations

Achieving Ultra-Low Network Latency: 2 Microseconds

Part 3: Network Resilience and Mitigation

Network Resilience for Gen-AI Workloads

Mitigating the Impact of Network Disruptions

Causes and Mitigation of Transient Link Flaps

Part 4: Bandwidth Management and Conclusion

Managing Bandwidth Congestion in Non-Blocking Networks

First Principles: Inside Zettascale OCI Superclusters for Next-Gen AI

Oracle

Part 1: Introduction to OCI Superclusters

00:00Introduction: Zeta-scale OCI Superclusters and AI Innovations

Introduction: Zeta-scale OCI Superclusters and AI Innovations

01:18Building the Cluster Network for Zeta-scale GPU Clusters

Building the Cluster Network for Zeta-scale GPU Clusters

04:02RDMA at Scale: Rocky V2 and Addressing Challenges

RDMA at Scale: Rocky V2 and Addressing Challenges

Part 2: Network Innovations for Gen-AI

06:30Gen-AI Specific Requirements and Network Innovations

Gen-AI Specific Requirements and Network Innovations

08:43Achieving Ultra-Low Network Latency: 2 Microseconds

Achieving Ultra-Low Network Latency: 2 Microseconds

Part 3: Network Resilience and Mitigation

13:09Network Resilience for Gen-AI Workloads

Network Resilience for Gen-AI Workloads

15:16Mitigating the Impact of Network Disruptions

Mitigating the Impact of Network Disruptions

21:14Causes and Mitigation of Transient Link Flaps

Causes and Mitigation of Transient Link Flaps

Part 4: Bandwidth Management and Conclusion

24:51Managing Bandwidth Congestion in Non-Blocking Networks

Managing Bandwidth Congestion in Non-Blocking Networks