PyTorch Symmetric Memory addresses communication bottlenecks in large language model (LLM) inference workloads by enabling direct memory access between processors. It allows developers to program GPU clusters as a single GPU, reducing latency and increasing flexibility. Symmetric Memory provides pre-written synchronization primitives, simplifying the development of communication kernels. It unlocks low-latency kernels, in-kernel communication and computation, and arbitrary communication patterns not covered by existing libraries. Use cases include optimized all-reduce operations, faster tensor parallelism through concurrent computation and communication, and token shuffling in MOE models by reading metadata directly on the device. Scaling to multi-node scenarios is achieved using NVishMem, originating messages directly on the GPU to saturate modern NICs. Future plans include support for heterogeneous hardware, new data types, novel algorithms, and integration with Torch Compile for auto-generated kernels, as well as improved error handling.
Sign in to continue reading, translating and more.
Continue