Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 9: RL for LLMs | Stanford Online

Archit Sharma discusses preference optimization in large language models, covering instruction fine-tuning, Reinforcement Learning from Human Preferences (RLHF), and Direct Preference Optimization (DPO). The lecture addresses how to transform pre-trained language models into helpful assistants by aligning them with human intent. Instruction fine-tuning teaches formatting but is limited by the expense of data collection and the lack of definitive answers for creative tasks. RLHF uses a reward model to align with human intent, maximizing expected rewards through policy gradient methods, while DPO simplifies the process by directly optimizing language model parameters from preference data, framing it as a binary classification problem. The discussion also explores the limitations of current paradigms, including reward hacking and the challenges of collecting representative preference data, and touches on future directions such as personalized language models and verifiable rewards.

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 9: RL for LLMs

Stanford Online

Introduction to Preference Optimization and Language Model Pre-training

Instruction Fine-Tuning: Benefits and Limitations

Reinforcement Learning from Human Preferences (RLHF)

Preference Learning and the RLHF Loop

Direct Preference Optimization (DPO)

Applications, Limitations, and Future Directions

Stanford CS224R Deep Reinforcement Learning | Spring 2025 | Lecture 9: RL for LLMs

Stanford Online

00:05Introduction to Preference Optimization and Language Model Pre-training

Introduction to Preference Optimization and Language Model Pre-training

05:12Instruction Fine-Tuning: Benefits and Limitations

Instruction Fine-Tuning: Benefits and Limitations

12:50Reinforcement Learning from Human Preferences (RLHF)

Reinforcement Learning from Human Preferences (RLHF)

25:51Preference Learning and the RLHF Loop

Preference Learning and the RLHF Loop

37:38Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO)

53:15Applications, Limitations, and Future Directions

Applications, Limitations, and Future Directions