Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 10: Inference | Stanford Online

Lecture 10 focuses on inference in the context of language models, distinguishing it from training by highlighting its memory-limited nature and the challenges of generating tokens sequentially. The lecture covers key inference metrics like time to first token, latency, and throughput, and explains how these are affected by factors such as batch size and the architecture of the model. It also explores various techniques to improve inference efficiency, including reducing KV cache size through methods like Group Query Attention, Multi-head Latent Attention, Cross-Layer Attention, and local attention. The lecture further discusses more radical approaches like state-space models and diffusion models, as well as practical methods such as quantization, model pruning, and speculative decoding, and concludes with systems-level optimizations for handling dynamic, real-world inference scenarios.

Outlines

Part 1: Introduction and Workload Analysis

Part 2: Lossless and Architectural Methods

Part 3: Shortcut Techniques and Practical Considerations

Sign in to continue reading, translating and more.

Open full episode in Podwise

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 10: Inference

Stanford Online

Part 1: Introduction and Workload Analysis

Introduction to Inference in Language Models

Understanding Inference Workload: Transformer Math and Arithmetic Intensity

Arithmetic Intensity of Inference: Pre-fill vs. Generation

Throughput and Latency in Inference: A Theoretical Analysis

Part 2: Lossless and Architectural Methods

Lossless and Lossy Methods for Faster Inference: Reducing the KV Cache

Radical Architectural Changes for Faster Inference: State-Space Models and Diffusion Models

Part 3: Shortcut Techniques and Practical Considerations

Taking Shortcuts: Quantization, Model Pruning, and Speculative Decoding

Handling Dynamic Traffic and Memory Fragmentation in Inference

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 10: Inference

Stanford Online

Part 1: Introduction and Workload Analysis

00:04Introduction to Inference in Language Models

Introduction to Inference in Language Models

05:45Understanding Inference Workload: Transformer Math and Arithmetic Intensity

Understanding Inference Workload: Transformer Math and Arithmetic Intensity

13:05Arithmetic Intensity of Inference: Pre-fill vs. Generation

Arithmetic Intensity of Inference: Pre-fill vs. Generation

27:30Throughput and Latency in Inference: A Theoretical Analysis

Throughput and Latency in Inference: A Theoretical Analysis

Part 2: Lossless and Architectural Methods

37:45Lossless and Lossy Methods for Faster Inference: Reducing the KV Cache

Lossless and Lossy Methods for Faster Inference: Reducing the KV Cache

51:15Radical Architectural Changes for Faster Inference: State-Space Models and Diffusion Models

Radical Architectural Changes for Faster Inference: State-Space Models and Diffusion Models

Part 3: Shortcut Techniques and Practical Considerations

1:04:53Taking Shortcuts: Quantization, Model Pruning, and Speculative Decoding

Taking Shortcuts: Quantization, Model Pruning, and Speculative Decoding

1:17:30Handling Dynamic Traffic and Memory Fragmentation in Inference

Handling Dynamic Traffic and Memory Fragmentation in Inference