CuTe DSL Tutorials on Optimizing NVFP4 GEMM for Blackwell

The podcast details the optimization of NVFP4 GEMM (General Matrix Multiply) on Blackwell GPUs using the Qt DSL, aiming to approach "speed of light" (SOL) performance for large-shape problems. The initial baseline implementation achieved only 16.81% compute throughput, which prompted a series of optimizations. These include improving vectorization in the epilogue, introducing software pipeline optimization, and using warp specialization techniques. A key enhancement involved using two SM instructions combined with TMA multicast to reduce shared memory usage and L2 traffic. The optimization journey culminated in a persistent thread optimization and 256-bit vectorization, achieving around 94.3% SOL on large-shape GEMM. The speaker also briefly touches on applying these optimization concepts to GPU Mode NVFP4 GEMM computation problems, particularly for latency-bound shapes.

Outlines

Part 1: Architecture, Basics

Part 2: Baseline, Pipeline, Latency Hiding

Part 3: Advanced Hardware Features

Part 4: Final Optimizations, Results

Sign in to continue reading, translating and more.

Continue

NVIDIA Developer

Part 1: Architecture, Basics

Introduction to NVFP4 GEMM Optimization on Blackwell GPU Architecture

Understanding NVFP4 GEMM and Performance Metrics on Blackwell

Part 2: Baseline, Pipeline, Latency Hiding

Baseline NVFP4 GEMM Implementation and Epilogue Optimization

Software Pipeline Optimization for TMA Load and MMA Compute Overlap

Warp Specialization and Addressing Latency Hiding Limitations

Part 3: Advanced Hardware Features

Enhancing Throughput with Two SM Instructions and TMA Multicast

Persistent Thread Optimization for Improved Wave Overlapping

TMEM Early Release and Ping Pong Optimization for Wave Overlapping

Part 4: Final Optimizations, Results

Epilogue Optimization: TMA Store vs. 256-bit STG Instruction

Achieving 94.3% SOL with 256-bit Vectorization and Optimization Summary

Applying Optimizations to GPU Mode NVFP4 GEMM and Conclusion

CuTe DSL Tutorials on Optimizing NVFP4 GEMM for Blackwell

NVIDIA Developer

Part 1: Architecture, Basics

00:00Introduction to NVFP4 GEMM Optimization on Blackwell GPU Architecture

Introduction to NVFP4 GEMM Optimization on Blackwell GPU Architecture

02:31Understanding NVFP4 GEMM and Performance Metrics on Blackwell

Understanding NVFP4 GEMM and Performance Metrics on Blackwell

Part 2: Baseline, Pipeline, Latency Hiding

05:20Baseline NVFP4 GEMM Implementation and Epilogue Optimization

Baseline NVFP4 GEMM Implementation and Epilogue Optimization

09:04Software Pipeline Optimization for TMA Load and MMA Compute Overlap

Software Pipeline Optimization for TMA Load and MMA Compute Overlap

12:40Warp Specialization and Addressing Latency Hiding Limitations

Warp Specialization and Addressing Latency Hiding Limitations

Part 3: Advanced Hardware Features

14:41Enhancing Throughput with Two SM Instructions and TMA Multicast

Enhancing Throughput with Two SM Instructions and TMA Multicast

17:27Persistent Thread Optimization for Improved Wave Overlapping

Persistent Thread Optimization for Improved Wave Overlapping

19:14TMEM Early Release and Ping Pong Optimization for Wave Overlapping

TMEM Early Release and Ping Pong Optimization for Wave Overlapping

Part 4: Final Optimizations, Results

23:00Epilogue Optimization: TMA Store vs. 256-bit STG Instruction

Epilogue Optimization: TMA Store vs. 256-bit STG Instruction

26:34Achieving 94.3% SOL with 256-bit Vectorization and Optimization Summary

Achieving 94.3% SOL with 256-bit Vectorization and Optimization Summary

28:40Applying Optimizations to GPU Mode NVFP4 GEMM and Conclusion

Applying Optimizations to GPU Mode NVFP4 GEMM and Conclusion