PyTorch Symmetric Memory: A New Paradigm for Programming Distributed AI

Natalia Gimelshein and Ke Wen introduce Symmetric Memory in PyTorch, a new paradigm for programming distributed AI, designed to treat a GPU cluster as one large GPU. They discuss the history, challenges, and capabilities unlocked by Symmetric Memory, including low latency communication, in-kernel initiation, and arbitrary communication patterns. They provide examples of using Symmetric Memory in PyTorch, such as one-shot all-reduce, async tensor parallelism, and mixture of expert models, highlighting performance improvements and new possibilities. They also explain how to program Symmetric Memory using Triton DSL and discuss scaling it out to multiple nodes with RDMA and IBGDA, and future plans including support for new hardware, new operations, compiler integration with Torch Compile, and fault tolerance.

Outlines

Sign in to continue reading, translating and more.

Continue

@Scale

Introduction to Symmetric Memory in PyTorch

Symmetric Memory Programming Model and Capabilities

Using Symmetric Memory in PyTorch: Examples and Applications

Mixture of Experts (MOE) with Symmetric Memory

Programming Symmetric Memory with Triton

Scaling Symmetric Memory to Multi-Node Systems and Future Plans

PyTorch Symmetric Memory: A New Paradigm for Programming Distributed AI

@Scale

00:00Introduction to Symmetric Memory in PyTorch

Introduction to Symmetric Memory in PyTorch

02:11Symmetric Memory Programming Model and Capabilities

Symmetric Memory Programming Model and Capabilities

04:46Using Symmetric Memory in PyTorch: Examples and Applications

Using Symmetric Memory in PyTorch: Examples and Applications

08:32Mixture of Experts (MOE) with Symmetric Memory

Mixture of Experts (MOE) with Symmetric Memory

10:31Programming Symmetric Memory with Triton

Programming Symmetric Memory with Triton

13:13Scaling Symmetric Memory to Multi-Node Systems and Future Plans

Scaling Symmetric Memory to Multi-Node Systems and Future Plans