Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 6 - LLM Reasoning | Stanford Online

The lecture explores Large Language Model (LLM) reasoning, contrasting "vanilla LLMs" with reasoning models designed to solve complex problems through multi-step processes. It highlights weaknesses of vanilla LLMs, including limited reasoning, static knowledge, and difficulty in evaluation. Chain of Thought is introduced as a technique to improve reasoning by having LLMs decompose problems into tractable parts. The lecture also covers benchmarks for assessing coding and math abilities, such as HumanEval, Codeforces, and AIM, and introduces the "pass at k" metric for evaluating the probability of success across multiple attempts, while also considering the impact of temperature on solution diversity. GRPO (Group Relative Policy Optimization) is presented as an RL algorithm used to train reasoning models by maximizing advantages without value functions, differing from PPO. The lecture concludes with DeepSeq's R1 model training pipeline and knowledge distillation techniques for smaller models.

Outlines

Part 1: Context, Reasoning Basics

Part 2: Benchmarks, Metrics

Part 3: Training, RL Algorithms

Part 4: Case Study, Distillation

Sign in to continue reading, translating and more.

Open full episode in Podwise

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 6 - LLM Reasoning

Stanford Online

Part 1: Context, Reasoning Basics

Review of LLM Training: Pre-training, Fine-tuning, and Preference Tuning

Limitations of Vanilla LLMs and the Rise of Reasoning Models

Part 2: Benchmarks, Metrics

Benchmarks for Quantifying Reasoning Abilities: Coding and Math

Estimating Pass at K and the Impact of Temperature on Solution Diversity

Part 3: Training, RL Algorithms

Training Reasoning Models: The Role of RL and Incentivizing Chain of Thought

GRPO: Group Relative Policy Optimization for Reasoning Tasks

Extensions of GRPO: Optimizing Output Length and Token-Level Contributions

Part 4: Case Study, Distillation

DeepSeek's R1: A Full Pipeline for Reasoning Models and Knowledge Distillation

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 6 - LLM Reasoning

Stanford Online

Part 1: Context, Reasoning Basics

00:05Review of LLM Training: Pre-training, Fine-tuning, and Preference Tuning

Review of LLM Training: Pre-training, Fine-tuning, and Preference Tuning

08:51Limitations of Vanilla LLMs and the Rise of Reasoning Models

Limitations of Vanilla LLMs and the Rise of Reasoning Models

Part 2: Benchmarks, Metrics

27:56Benchmarks for Quantifying Reasoning Abilities: Coding and Math

Benchmarks for Quantifying Reasoning Abilities: Coding and Math

34:07Estimating Pass at K and the Impact of Temperature on Solution Diversity

Estimating Pass at K and the Impact of Temperature on Solution Diversity

Part 3: Training, RL Algorithms

47:33Training Reasoning Models: The Role of RL and Incentivizing Chain of Thought

Training Reasoning Models: The Role of RL and Incentivizing Chain of Thought

58:38GRPO: Group Relative Policy Optimization for Reasoning Tasks

GRPO: Group Relative Policy Optimization for Reasoning Tasks

1:11:36Extensions of GRPO: Optimizing Output Length and Token-Level Contributions

Extensions of GRPO: Optimizing Output Length and Token-Level Contributions

Part 4: Case Study, Distillation

1:29:40DeepSeek's R1: A Full Pipeline for Reasoning Models and Knowledge Distillation

DeepSeek's R1: A Full Pipeline for Reasoning Models and Knowledge Distillation