Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 16: Alignment - RL
Stanford Online
The lecture focuses on reinforcement learning (RL) techniques for language models, specifically contrasting RL from Human Feedback (RLHF) with RL from verifiable rewards. It begins by recapping Direct Preference Optimization (DPO) and its variants, highlighting the challenges of over-optimization and calibration issues in RLHF. The lecture then transitions to RL from verifiable rewards, emphasizing the use of Proximal Policy Optimization (PPO) and a simplified version called Group Robust Policy Optimization (GRPO). GRPO's algorithm, implementation, and advantages are explained, including a discussion on baseline adjustments and length normalization. Finally, case studies of three Chinese OpenLMs - DeepSeq R1, Kimi 1.5, and Quen 3 - are presented, detailing their methodologies, data curation, and insights into building reasoning models using RL.
Part 1: RLHF Recap & Introduction to Verifiable Rewards
Part 2: Algorithms: PPO and GRPO
Part 3: Case Studies: R1, Kimi K1.5, QEN3
Sign in to continue reading, translating and more.
Open full episode in Podwise