YouTube14 Nov 2025
1h 47m

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 6 - LLM Reasoning

Podcast cover

Stanford Online

The lecture explores Large Language Model (LLM) reasoning, contrasting "vanilla LLMs" with reasoning models designed to solve complex problems through multi-step processes. It highlights weaknesses of vanilla LLMs, including limited reasoning, static knowledge, and difficulty in evaluation. Chain of Thought is introduced as a technique to improve reasoning by having LLMs decompose problems into tractable parts. The lecture also covers benchmarks for assessing coding and math abilities, such as HumanEval, Codeforces, and AIM, and introduces the "pass at k" metric for evaluating the probability of success across multiple attempts, while also considering the impact of temperature on solution diversity. GRPO (Group Relative Policy Optimization) is presented as an RL algorithm used to train reasoning models by maximizing advantages without value functions, differing from PPO. The lecture concludes with DeepSeq's R1 model training pipeline and knowledge distillation techniques for smaller models.

Outlines

Part 1: Context, Reasoning Basics

Part 2: Benchmarks, Metrics

Part 3: Training, RL Algorithms

Part 4: Case Study, Distillation

Sign in to continue reading, translating and more.

Open full episode in Podwise