Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 15 - After DPO by Nathan Lambert
Stanford Online
This episode explores the evolution and future of reinforcement learning from human feedback (RLHF) in large language model (LLM) development. Against the backdrop of the rapid advancements in LLMs, the discussion highlights the increasing importance of post-training fine-tuning, particularly using RLHF, as opposed to solely relying on pre-training. More significantly, the guest introduces Direct Preference Optimization (DPO), a simpler and more scalable alternative to traditional RLHF methods like Proximal Policy Optimization (PPO), and details its impact on the field. For instance, the guest recounts the story of the Zephyr model, a pivotal moment when DPO gained widespread adoption. The conversation then pivots to the challenges of evaluating RLHF methods, leading to the introduction of RewardBench, a novel evaluation tool for reward models. Finally, the discussion concludes by examining the contrasting approaches of academia and industry in RLHF, emphasizing the need for more robust datasets and the exploration of online RLHF methods to further enhance LLM performance and alignment.
Part 1: Introduction and Context
Part 2: DPO and Model Training
Part 3: Reward Model Evaluation
Part 4: DPO Analysis and Comparisons
Part 5: Q&A and Future Outlook
Sign in to continue reading, translating and more.
Open full episode in Podwise