YouTube19 May 2025
1h 22m

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 10: Inference

Podcast cover

Stanford Online

Lecture 10 focuses on inference in the context of language models, distinguishing it from training by highlighting its memory-limited nature and the challenges of generating tokens sequentially. The lecture covers key inference metrics like time to first token, latency, and throughput, and explains how these are affected by factors such as batch size and the architecture of the model. It also explores various techniques to improve inference efficiency, including reducing KV cache size through methods like Group Query Attention, Multi-head Latent Attention, Cross-Layer Attention, and local attention. The lecture further discusses more radical approaches like state-space models and diffusion models, as well as practical methods such as quantization, model pruning, and speculative decoding, and concludes with systems-level optimizations for handling dynamic, real-world inference scenarios.

Outlines

Part 1: Introduction and Workload Analysis

Part 2: Lossless and Architectural Methods

Part 3: Shortcut Techniques and Practical Considerations

Sign in to continue reading, translating and more.

Open full episode in Podwise