David Becker and Jag Brar from Oracle Cloud Infrastructure (OCI) discuss the challenges and solutions involved in building large-scale RDMA (Remote Direct Memory Access) networks for cloud computing. They cover OCI's history, starting with HPC and Exadata workloads, and how they evolved to support GPU-intensive AI and machine learning applications. They address the complexities of multi-tenancy, network virtualization, and the need for low latency and high throughput. They also discuss the technical aspects of their network, including the use of RoCE (RDMA over Converged Ethernet), congestion control mechanisms, and the challenges of optical link stability and traffic distribution, highlighting the innovations required to scale RDMA networks to hundreds of thousands of nodes while maintaining performance and isolation.
Sign in to continue reading, translating and more.
Continue