YouTube01 Jul 2025
1h 20m

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 16: Alignment - RL

Podcast cover

Stanford Online

The lecture focuses on reinforcement learning (RL) techniques for language models, specifically contrasting RL from Human Feedback (RLHF) with RL from verifiable rewards. It begins by recapping Direct Preference Optimization (DPO) and its variants, highlighting the challenges of over-optimization and calibration issues in RLHF. The lecture then transitions to RL from verifiable rewards, emphasizing the use of Proximal Policy Optimization (PPO) and a simplified version called Group Robust Policy Optimization (GRPO). GRPO's algorithm, implementation, and advantages are explained, including a discussion on baseline adjustments and length normalization. Finally, case studies of three Chinese OpenLMs - DeepSeq R1, Kimi 1.5, and Quen 3 - are presented, detailing their methodologies, data curation, and insights into building reasoning models using RL.

Outlines

Part 1: RLHF Recap & Introduction to Verifiable Rewards

Part 2: Algorithms: PPO and GRPO

Part 3: Case Studies: R1, Kimi K1.5, QEN3

Sign in to continue reading, translating and more.

Open full episode in Podwise