Episode cover
YouTube31 Jan 2026

CuTe DSL Tutorials on Optimizing NVFP4 GEMM for Blackwell

Podcast cover

NVIDIA Developer

The podcast details the optimization of NVFP4 GEMM (General Matrix Multiply) on Blackwell GPUs using the Qt DSL, aiming to approach "speed of light" (SOL) performance for large-shape problems. The initial baseline implementation achieved only 16.81% compute throughput, which prompted a series of optimizations. These include improving vectorization in the epilogue, introducing software pipeline optimization, and using warp specialization techniques. A key enhancement involved using two SM instructions combined with TMA multicast to reduce shared memory usage and L2 traffic. The optimization journey culminated in a persistent thread optimization and 256-bit vectorization, achieving around 94.3% SOL on large-shape GEMM. The speaker also briefly touches on applying these optimization concepts to GPU Mode NVFP4 GEMM computation problems, particularly for latency-bound shapes.

Outlines

Part 1: Architecture, Basics

Part 2: Baseline, Pipeline, Latency Hiding

Part 3: Advanced Hardware Features

Part 4: Final Optimizations, Results

Sign in to continue reading, translating and more.

Open full episode in Podwise