
The podcast details the optimization of NVFP4 GEMM (General Matrix Multiply) on Blackwell GPUs using the Qt DSL, aiming to approach "speed of light" (SOL) performance for large-shape problems. The initial baseline implementation achieved only 16.81% compute throughput, which prompted a series of optimizations. These include improving vectorization in the epilogue, introducing software pipeline optimization, and using warp specialization techniques. A key enhancement involved using two SM instructions combined with TMA multicast to reduce shared memory usage and L2 traffic. The optimization journey culminated in a persistent thread optimization and 256-bit vectorization, achieving around 94.3% SOL on large-shape GEMM. The speaker also briefly touches on applying these optimization concepts to GPU Mode NVFP4 GEMM computation problems, particularly for latency-bound shapes.
Sign in to continue reading, translating and more.
Continue