Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 10: Inference | Stanford Online

Lecture 10 focuses on inference in the context of language models, distinguishing it from training by highlighting its memory-limited nature and the challenges of generating tokens sequentially. The lecture covers key inference metrics like time to first token, latency, and throughput, and explains how these are affected by factors such as batch size and the architecture of the model. It also explores various techniques to improve inference efficiency, including reducing KV cache size through methods like Group Query Attention, Multi-head Latent Attention, Cross-Layer Attention, and local attention. The lecture further discusses more radical approaches like state-space models and diffusion models, as well as practical methods such as quantization, model pruning, and speculative decoding, and concludes with systems-level optimizations for handling dynamic, real-world inference scenarios.

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 10: Inference

Stanford Online

Introduction to Inference in Language Models

Understanding Inference Workload: Transformer Math and Arithmetic Intensity

Arithmetic Intensity of Inference: Pre-fill vs. Generation

Throughput and Latency in Inference: A Theoretical Analysis

Lossless and Lossy Methods for Faster Inference: Reducing the KV Cache

Radical Architectural Changes for Faster Inference: State-Space Models and Diffusion Models

Taking Shortcuts: Quantization, Model Pruning, and Speculative Decoding

Handling Dynamic Traffic and Memory Fragmentation in Inference

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 10: Inference

Stanford Online

00:04Introduction to Inference in Language Models

Introduction to Inference in Language Models

05:45Understanding Inference Workload: Transformer Math and Arithmetic Intensity

Understanding Inference Workload: Transformer Math and Arithmetic Intensity

13:05Arithmetic Intensity of Inference: Pre-fill vs. Generation

Arithmetic Intensity of Inference: Pre-fill vs. Generation

27:30Throughput and Latency in Inference: A Theoretical Analysis

Throughput and Latency in Inference: A Theoretical Analysis

37:45Lossless and Lossy Methods for Faster Inference: Reducing the KV Cache

Lossless and Lossy Methods for Faster Inference: Reducing the KV Cache

51:15Radical Architectural Changes for Faster Inference: State-Space Models and Diffusion Models

Radical Architectural Changes for Faster Inference: State-Space Models and Diffusion Models

1:04:53Taking Shortcuts: Quantization, Model Pruning, and Speculative Decoding

Taking Shortcuts: Quantization, Model Pruning, and Speculative Decoding

1:17:30Handling Dynamic Traffic and Memory Fragmentation in Inference

Handling Dynamic Traffic and Memory Fragmentation in Inference