Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 10: RL for LLM Reasoning | Stanford Online

The podcast discusses reinforcement learning (RL) techniques for improving reasoning in large language models (LLMs), particularly for solving math problems. It contrasts conventional next token prediction methods with RL-based approaches, highlighting the limitations of relying solely on supervised training due to data scarcity and the presence of spurious steps in generated solutions. The lecture explores classical RL methods, including imitation learning, offline RL (like DPO), and online RL, emphasizing the importance of credit assignment to individual steps in a reasoning process. Modern extensions for training "thinking models" are also examined, noting the significance of the base model's ability to implement meta-procedures like answer checking and verification, ultimately enhancing the quality and efficiency of reasoning in LLMs.

Outlines

Part 1: Context, Limitations of SFT

Part 2: Classical RL, Data Scaling, and RFT

Part 3: Spurious Steps, Value Functions, and Offline RL

Part 4: Online RL and Modern Thinking Models

Sign in to continue reading, translating and more.

Open full episode in Podwise

Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 10: RL for LLM Reasoning

Stanford Online

Part 1: Context, Limitations of SFT

Introduction to Reasoning in Language Models and the Limitations of Next Token Prediction

The Failure of Supervised Training in Complex Reasoning and the Role of Reinforcement Learning

Part 2: Classical RL, Data Scaling, and RFT

Classical RL Methods for Reasoning

Data Scaling Analysis and Rejection Fine-Tuning

Empirical Results of SFT and RFT and the Issue of Overgeneralization

Part 3: Spurious Steps, Value Functions, and Offline RL

Spurious Steps and the Problem of Causal Confusion

Addressing Spurious Steps with Value Functions and Advantage Functions

Training with Advantages and Filtering

Offline Reinforcement Learning with DPO

Part 4: Online RL and Modern Thinking Models

Empirical Results of Offline RL and Introduction to Online Reinforcement Learning

Modern Thinking Models and the Role of Action Space

Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 10: RL for LLM Reasoning

Stanford Online

Part 1: Context, Limitations of SFT

00:05Introduction to Reasoning in Language Models and the Limitations of Next Token Prediction

Introduction to Reasoning in Language Models and the Limitations of Next Token Prediction

04:04The Failure of Supervised Training in Complex Reasoning and the Role of Reinforcement Learning

The Failure of Supervised Training in Complex Reasoning and the Role of Reinforcement Learning

Part 2: Classical RL, Data Scaling, and RFT

06:10Classical RL Methods for Reasoning

Classical RL Methods for Reasoning

12:23Data Scaling Analysis and Rejection Fine-Tuning

Data Scaling Analysis and Rejection Fine-Tuning

19:49Empirical Results of SFT and RFT and the Issue of Overgeneralization

Empirical Results of SFT and RFT and the Issue of Overgeneralization

Part 3: Spurious Steps, Value Functions, and Offline RL

27:38Spurious Steps and the Problem of Causal Confusion

Spurious Steps and the Problem of Causal Confusion

35:22Addressing Spurious Steps with Value Functions and Advantage Functions

Addressing Spurious Steps with Value Functions and Advantage Functions

42:30Training with Advantages and Filtering

Training with Advantages and Filtering

50:48Offline Reinforcement Learning with DPO

Offline Reinforcement Learning with DPO

Part 4: Online RL and Modern Thinking Models

58:35Empirical Results of Offline RL and Introduction to Online Reinforcement Learning

Empirical Results of Offline RL and Introduction to Online Reinforcement Learning

1:07:33Modern Thinking Models and the Role of Action Space

Modern Thinking Models and the Role of Action Space