YouTube01 May 2025
1h 14m

Stanford CS336 I Language Modeling from Scratch | Spring 2025 | Lecture 5: GPUs

Podcast cover

Stanford Online

The podcast discusses GPUs and their importance in language models, aiming to demystify CUDA and GPU performance. It covers why GPUs get slow, how to create fast algorithms like Flash Attention, and the key components needed for acceleration. The lecture emphasizes the differences between CPUs and GPUs, GPU anatomy (SMs, SPs), and the critical role of memory in GPU performance. It also touches on TPUs, compute scaling, and various optimization techniques like lower precision, operator fusion, recomputation, memory coalescing, and tiling. The discussion culminates in explaining the performance characteristics of matrix multiplication on GPUs and how Flash Attention leverages these optimizations for faster transformer performance.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise