Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 16: Alignment - RL | Stanford Online

The lecture focuses on reinforcement learning (RL) techniques for language models, specifically contrasting RL from Human Feedback (RLHF) with RL from verifiable rewards. It begins by recapping Direct Preference Optimization (DPO) and its variants, highlighting the challenges of over-optimization and calibration issues in RLHF. The lecture then transitions to RL from verifiable rewards, emphasizing the use of Proximal Policy Optimization (PPO) and a simplified version called Group Robust Policy Optimization (GRPO). GRPO's algorithm, implementation, and advantages are explained, including a discussion on baseline adjustments and length normalization. Finally, case studies of three Chinese OpenLMs - DeepSeq R1, Kimi 1.5, and Quen 3 - are presented, detailing their methodologies, data curation, and insights into building reasoning models using RL.

Outlines

Part 1: RLHF Recap & Introduction to Verifiable Rewards

Part 2: Algorithms: PPO and GRPO

Part 3: Case Studies: R1, Kimi K1.5, QEN3

Sign in to continue reading, translating and more.

Open full episode in Podwise

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 16: Alignment - RL

Stanford Online

Part 1: RLHF Recap & Introduction to Verifiable Rewards

Introduction to Reinforcement Learning and Finishing RLHF

Transition to Reinforcement Learning from Verifiable Rewards

Part 2: Algorithms: PPO and GRPO

Deep Dive into Proximal Policy Optimization (PPO)

Introducing Grouped Relative Policy Optimization (GRPO)

Analyzing GRPO and Transitioning to R1

Part 3: Case Studies: R1, Kimi K1.5, QEN3

Deep Dive into R1: Methodology and Results

R1's Findings and Introduction to Kimi K1.5

Kimi K1.5's RL Algorithm and Infrastructure

QEN3 and Conclusion

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 16: Alignment - RL

Stanford Online

Part 1: RLHF Recap & Introduction to Verifiable Rewards

00:05Introduction to Reinforcement Learning and Finishing RLHF

Introduction to Reinforcement Learning and Finishing RLHF

09:06Transition to Reinforcement Learning from Verifiable Rewards

Transition to Reinforcement Learning from Verifiable Rewards

Part 2: Algorithms: PPO and GRPO

14:31Deep Dive into Proximal Policy Optimization (PPO)

Deep Dive into Proximal Policy Optimization (PPO)

27:33Introducing Grouped Relative Policy Optimization (GRPO)

Introducing Grouped Relative Policy Optimization (GRPO)

35:57Analyzing GRPO and Transitioning to R1

Analyzing GRPO and Transitioning to R1

Part 3: Case Studies: R1, Kimi K1.5, QEN3

45:08Deep Dive into R1: Methodology and Results

Deep Dive into R1: Methodology and Results

57:14R1's Findings and Introduction to Kimi K1.5

R1's Findings and Introduction to Kimi K1.5

1:05:08Kimi K1.5's RL Algorithm and Infrastructure

Kimi K1.5's RL Algorithm and Infrastructure

1:13:11QEN3 and Conclusion

QEN3 and Conclusion