First Principles: superclusters with RDMA—Ultra-high performance at massive scale

This episode explores the architectural design of Oracle Cloud Infrastructure's (OCI) RDMA supercluster, a high-scale GPU workload solution. Against the backdrop of increasing customer demand for large-scale GPU workloads (potentially tens of thousands of GPUs), OCI developed this supercluster, leveraging a three-tier CLO network. More significantly, the design addresses the challenge of increased latency in such a large-scale system by employing lossless networking with enhanced buffering and intelligent congestion notification. For instance, while the worst-case latency might reach 20 microseconds, this is still significantly lower than typical cloud networks, and the system prioritizes placement of workloads to minimize latency where possible. The discussion further highlights the use of "network locality hints," a service providing information to customers to optimize their GPU topology and reduce latency. This allows for a balance between scale and latency, ensuring that even latency-sensitive workloads can benefit from the supercluster's capabilities. In essence, OCI's RDMA supercluster represents a significant advancement in high-performance computing infrastructure, offering both scalability and performance optimization for diverse workloads.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Oracle

Introduction to RDMA and OCI's Investment

Introducing the RDMA Supercluster: High-Level Overview

RDMA Supercluster Architecture: The Three-Tier CLO Fabric

Addressing Latency Concerns for Different Workloads

Optimizations for GPU Workloads Spanning Multiple Blocks

First Principles: superclusters with RDMA—Ultra-high performance at massive scale

Oracle

00:14Introduction to RDMA and OCI's Investment

Introduction to RDMA and OCI's Investment

03:07Introducing the RDMA Supercluster: High-Level Overview

Introducing the RDMA Supercluster: High-Level Overview

04:13RDMA Supercluster Architecture: The Three-Tier CLO Fabric

RDMA Supercluster Architecture: The Three-Tier CLO Fabric

07:07Addressing Latency Concerns for Different Workloads

Addressing Latency Concerns for Different Workloads

08:39Optimizations for GPU Workloads Spanning Multiple Blocks

Optimizations for GPU Workloads Spanning Multiple Blocks