YouTube08 Dec 2025
1h 10m

Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 10: RL for LLM Reasoning

Podcast cover

Stanford Online

The podcast discusses reinforcement learning (RL) techniques for improving reasoning in large language models (LLMs), particularly for solving math problems. It contrasts conventional next token prediction methods with RL-based approaches, highlighting the limitations of relying solely on supervised training due to data scarcity and the presence of spurious steps in generated solutions. The lecture explores classical RL methods, including imitation learning, offline RL (like DPO), and online RL, emphasizing the importance of credit assignment to individual steps in a reasoning process. Modern extensions for training "thinking models" are also examined, noting the significance of the base model's ability to implement meta-procedures like answer checking and verification, ultimately enhancing the quality and efficiency of reasoning in LLMs.

Outlines

Part 1: Context, Limitations of SFT

Part 2: Classical RL, Data Scaling, and RFT

Part 3: Spurious Steps, Value Functions, and Offline RL

Part 4: Online RL and Modern Thinking Models

Sign in to continue reading, translating and more.

Open full episode in Podwise