Stanford CS336 I Language Modeling from Scratch | Spring 2025 | Lecture 5: GPUs | Stanford Online

The podcast discusses GPUs and their importance in language models, aiming to demystify CUDA and GPU performance. It covers why GPUs get slow, how to create fast algorithms like Flash Attention, and the key components needed for acceleration. The lecture emphasizes the differences between CPUs and GPUs, GPU anatomy (SMs, SPs), and the critical role of memory in GPU performance. It also touches on TPUs, compute scaling, and various optimization techniques like lower precision, operator fusion, recomputation, memory coalescing, and tiling. The discussion culminates in explaining the performance characteristics of matrix multiplication on GPUs and how Flash Attention leverages these optimizations for faster transformer performance.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Stanford CS336 I Language Modeling from Scratch | Spring 2025 | Lecture 5: GPUs

Stanford Online

Introduction to GPUs and Assignment Details

GPU Architecture and Memory Model

TPUs and the Importance of Matrix Multiplies

GPU Performance Optimization: Roofline Model and Precision

GPU Performance Optimization: Recomputation and Memory Coalescing

GPU Performance Optimization: Tiling and Alignment

Flash Attention: Putting It All Together

Stanford CS336 I Language Modeling from Scratch | Spring 2025 | Lecture 5: GPUs

Stanford Online

00:04Introduction to GPUs and Assignment Details

Introduction to GPUs and Assignment Details

07:07GPU Architecture and Memory Model

GPU Architecture and Memory Model

16:22TPUs and the Importance of Matrix Multiplies

TPUs and the Importance of Matrix Multiplies

25:02GPU Performance Optimization: Roofline Model and Precision

GPU Performance Optimization: Roofline Model and Precision

37:33GPU Performance Optimization: Recomputation and Memory Coalescing

GPU Performance Optimization: Recomputation and Memory Coalescing

47:13GPU Performance Optimization: Tiling and Alignment

GPU Performance Optimization: Tiling and Alignment

1:03:15Flash Attention: Putting It All Together

Flash Attention: Putting It All Together