YouTube17 Oct 2025
1h 48m

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 3 - Tranformers & Large Language Models

Podcast cover

Stanford Online

The lecture introduces Large Language Models (LLMs), defining them as large-scale language models predicting token sequences, emphasizing their size in parameters, training data, and computational needs. It distinguishes LLMs from earlier models like BERT, highlighting the decoder-only architecture and introduces Mixture of Experts (MOE) to optimize computational efficiency by activating subsets of model parameters. The discussion covers dense versus sparse MOEs and techniques to prevent routing collapse during training. It also explores response generation, contrasting greedy decoding and beam search with sampling methods like top-K and top-P sampling, alongside the impact of temperature on output diversity. The lecture concludes with strategies to improve LLM efficiency, including KV caching, group query attention, and speculative decoding.

Outlines

Part 1: Introduction, LLM Fundamentals

Part 2: Mixture of Experts (MOE)

Part 3: Decoding, Sampling Strategies

Part 4: Prompt Engineering, In-Context Learning

Part 5: Efficient Inference, Optimization

Sign in to continue reading, translating and more.

Open full episode in Podwise