YouTube08 Jul 2025
1h 16m

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 17: Alignment - RL 2

Podcast cover

Stanford Online

The lecture focuses on reinforcement learning (RL) for language models, specifically delving into policy gradient methods like GRPO. It explains how states, actions, and rewards are defined in this context, emphasizing verifiable outcome rewards. The discussion covers the policy gradient theorem, naive policy gradient, and the challenges of high noise and variance. The lecture introduces the concept of baselines to reduce variance, illustrating with examples and discussing the optimal baseline choice and its connection to advantage functions. It also presents a simple task of sorting numbers to demonstrate the implementation of GRPO, including code snippets for model definition, reward functions, and loss computation, highlighting the importance of freezing parameters and different strategies for computing deltas and KL penalties.

Outlines

Part 1: RL for Language Models

Part 2: gRPO and Sorting Task

Part 3: Algorithm and Results

Sign in to continue reading, translating and more.

Open full episode in Podwise