Natalia Gimelshein and Ke Wen introduce Symmetric Memory in PyTorch, a new paradigm for programming distributed AI, designed to treat a GPU cluster as one large GPU. They discuss the history, challenges, and capabilities unlocked by Symmetric Memory, including low latency communication, in-kernel initiation, and arbitrary communication patterns. They provide examples of using Symmetric Memory in PyTorch, such as one-shot all-reduce, async tensor parallelism, and mixture of expert models, highlighting performance improvements and new possibilities. They also explain how to program Symmetric Memory using Triton DSL and discuss scaling it out to multiple nodes with RDMA and IBGDA, and future plans including support for new hardware, new operations, compiler integration with Torch Compile, and fault tolerance.
Sign in to continue reading, translating and more.
Continue