PyTorch Symmetric Memory: A New Paradigm for Programming Distributed AI - Live from SCC | @Scale

PyTorch Symmetric Memory addresses communication bottlenecks in large language model (LLM) inference workloads by enabling direct memory access between processors. It allows developers to program GPU clusters as a single GPU, reducing latency and increasing flexibility. Symmetric Memory provides pre-written synchronization primitives, simplifying the development of communication kernels. It unlocks low-latency kernels, in-kernel communication and computation, and arbitrary communication patterns not covered by existing libraries. Use cases include optimized all-reduce operations, faster tensor parallelism through concurrent computation and communication, and token shuffling in MOE models by reading metadata directly on the device. Scaling to multi-node scenarios is achieved using NVishMem, originating messages directly on the GPU to saturate modern NICs. Future plans include support for heterogeneous hardware, new data types, novel algorithms, and integration with Torch Compile for auto-generated kernels, as well as improved error handling.

Outlines

Part 1: Introduction, Core Concepts

Part 2: Technical Capabilities, Performance

Part 3: Implementation, Custom Kernels

Part 4: Future Outlook, Summary

Sign in to continue reading, translating and more.

Continue

PyTorch Symmetric Memory: A New Paradigm for Programming Distributed AI - Live from SCC

@Scale

Part 1: Introduction, Core Concepts

Introducing PyTorch Symmetric Memory: A New Paradigm for Distributed AI

Symmetric Memory: Shared Memory for Efficient Inter-Process Communication

Part 2: Technical Capabilities, Performance

Unlocking New Capabilities with Symmetric Memory: Low Latency Kernels and More

Enhancing Tensor Parallelism with Symmetric Memory for Faster Communication

Enabling Token Shuffling in MOE Models with Symmetric Memory

Part 3: Implementation, Custom Kernels

Writing Custom Communication Kernels with Symmetric Memory and Triton

Scaling to Multi-Node Scenarios: GPUs Talking Directly with Symmetric Memory

NVishMem Integration: Triton Kernels for GPU-to-GPU Communication

Part 4: Future Outlook, Summary

Future Directions: Heterogeneous Hardware, New Data Types, and Error Handling

Key Takeaways: Direct Memory Access and Developer Tools for Modern LLM Workloads

PyTorch Symmetric Memory: A New Paradigm for Programming Distributed AI - Live from SCC

@Scale

Part 1: Introduction, Core Concepts

00:03Introducing PyTorch Symmetric Memory: A New Paradigm for Distributed AI

Introducing PyTorch Symmetric Memory: A New Paradigm for Distributed AI

01:51Symmetric Memory: Shared Memory for Efficient Inter-Process Communication

Symmetric Memory: Shared Memory for Efficient Inter-Process Communication

Part 2: Technical Capabilities, Performance

03:47Unlocking New Capabilities with Symmetric Memory: Low Latency Kernels and More

Unlocking New Capabilities with Symmetric Memory: Low Latency Kernels and More

05:23Enhancing Tensor Parallelism with Symmetric Memory for Faster Communication

Enhancing Tensor Parallelism with Symmetric Memory for Faster Communication

08:13Enabling Token Shuffling in MOE Models with Symmetric Memory

Enabling Token Shuffling in MOE Models with Symmetric Memory

Part 3: Implementation, Custom Kernels

10:07Writing Custom Communication Kernels with Symmetric Memory and Triton

Writing Custom Communication Kernels with Symmetric Memory and Triton

12:15Scaling to Multi-Node Scenarios: GPUs Talking Directly with Symmetric Memory

Scaling to Multi-Node Scenarios: GPUs Talking Directly with Symmetric Memory

14:26NVishMem Integration: Triton Kernels for GPU-to-GPU Communication

NVishMem Integration: Triton Kernels for GPU-to-GPU Communication

Part 4: Future Outlook, Summary

15:55Future Directions: Heterogeneous Hardware, New Data Types, and Error Handling

Future Directions: Heterogeneous Hardware, New Data Types, and Error Handling

17:39Key Takeaways: Direct Memory Access and Developer Tools for Modern LLM Workloads

Key Takeaways: Direct Memory Access and Developer Tools for Modern LLM Workloads