The lecture focuses on reinforcement learning (RL) techniques for language models, specifically contrasting RL from Human Feedback (RLHF) with RL from verifiable rewards. It begins by recapping Direct Preference Optimization (DPO) and its variants, highlighting the challenges of over-optimization and calibration issues in RLHF. The lecture then transitions to RL from verifiable rewards, emphasizing the use of Proximal Policy Optimization (PPO) and a simplified version called Group Robust Policy Optimization (GRPO). GRPO's algorithm, implementation, and advantages are explained, including a discussion on baseline adjustments and length normalization. Finally, case studies of three Chinese OpenLMs - DeepSeq R1, Kimi 1.5, and Quen 3 - are presented, detailing their methodologies, data curation, and insights into building reasoning models using RL.
Sign in to continue reading, translating and more.
Continue