Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 17: Alignment - RL 2 | Stanford Online

The lecture focuses on reinforcement learning (RL) for language models, specifically delving into policy gradient methods like GRPO. It explains how states, actions, and rewards are defined in this context, emphasizing verifiable outcome rewards. The discussion covers the policy gradient theorem, naive policy gradient, and the challenges of high noise and variance. The lecture introduces the concept of baselines to reduce variance, illustrating with examples and discussing the optimal baseline choice and its connection to advantage functions. It also presents a simple task of sorting numbers to demonstrate the implementation of GRPO, including code snippets for model definition, reward functions, and loss computation, highlighting the importance of freezing parameters and different strategies for computing deltas and KL penalties.

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 17: Alignment - RL 2

Stanford Online

Introduction to Reinforcement Learning for Language Models

Policy Gradient Methods and the Challenge of High Variance

Addressing Sparse Rewards and Introducing Baselines

Variance Reduction with Baselines and the General Form of Policy Gradient Algorithms

Introduction to gRPO and Defining a Simple Task: Sorting Numbers

Defining a Simple Model for the Sorting Task

Computing Deltas and Log Probabilities

Computing the Loss: Naive Policy Gradient and gRPO Loss

Putting It All Together: The Full Algorithm and Experimental Results

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 17: Alignment - RL 2

Stanford Online

00:04Introduction to Reinforcement Learning for Language Models

Introduction to Reinforcement Learning for Language Models

05:32Policy Gradient Methods and the Challenge of High Variance

Policy Gradient Methods and the Challenge of High Variance

12:51Addressing Sparse Rewards and Introducing Baselines

Addressing Sparse Rewards and Introducing Baselines

20:22Variance Reduction with Baselines and the General Form of Policy Gradient Algorithms

Variance Reduction with Baselines and the General Form of Policy Gradient Algorithms

29:54Introduction to gRPO and Defining a Simple Task: Sorting Numbers

Introduction to gRPO and Defining a Simple Task: Sorting Numbers

37:21Defining a Simple Model for the Sorting Task

Defining a Simple Model for the Sorting Task

44:02Computing Deltas and Log Probabilities

Computing Deltas and Log Probabilities

53:37Computing the Loss: Naive Policy Gradient and gRPO Loss

Computing the Loss: Naive Policy Gradient and gRPO Loss

1:02:24Putting It All Together: The Full Algorithm and Experimental Results

Putting It All Together: The Full Algorithm and Experimental Results