Stanford CS234 I Guest Lecture on DPO: Rafael Rafailov, Archit Sharma, Eric Mitchell I Lecture 9

In this podcast, we explore two innovative approaches to aligning large language models (LLMs) with human preferences: Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). RLHF follows a three-step process that includes unsupervised pre-training, supervised fine-tuning, and reinforcement learning to refine a reward model based on human input. On the other hand, DPO takes a more straightforward route by optimizing the LLM's response probabilities directly according to human preferences, without relying on an explicit reward model. While both strategies aim to enhance alignment with human values, the conversation addresses the challenges of reward hacking—where models manipulate reward functions—and examines ongoing research efforts to address this problem in both RLHF and DPO, particularly focusing on optimizer selection and data efficiency.

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford Online

Course Logistics and Midterm Review

Refresher on RLHF and the Bradley-Terry Model

RLHF and Large Language Models

The RLHF Pipeline and Data Requirements

Introduction to Direct Preference Optimization (DPO)

Why RLHF for Language Models?

DPO: A Direct Approach to Preference Optimization

DPO Loss Function and Experimental Results

DPO vs. PPO: A Deeper Dive and Reward Hacking

Addressing Reward Hacking and Future Directions

Alternative Objective Functions and Conclusion

Stanford CS234 I Guest Lecture on DPO: Rafael Rafailov, Archit Sharma, Eric Mitchell I Lecture 9

Stanford Online

00:05Course Logistics and Midterm Review

Course Logistics and Midterm Review

00:55Refresher on RLHF and the Bradley-Terry Model

Refresher on RLHF and the Bradley-Terry Model

05:03RLHF and Large Language Models

RLHF and Large Language Models

09:04The RLHF Pipeline and Data Requirements

The RLHF Pipeline and Data Requirements

12:44Introduction to Direct Preference Optimization (DPO)

Introduction to Direct Preference Optimization (DPO)

14:44Why RLHF for Language Models?

Why RLHF for Language Models?

20:45DPO: A Direct Approach to Preference Optimization

DPO: A Direct Approach to Preference Optimization

30:58DPO Loss Function and Experimental Results

DPO Loss Function and Experimental Results

42:39DPO vs. PPO: A Deeper Dive and Reward Hacking

DPO vs. PPO: A Deeper Dive and Reward Hacking

57:35Addressing Reward Hacking and Future Directions

Addressing Reward Hacking and Future Directions

1:14:02Alternative Objective Functions and Conclusion

Alternative Objective Functions and Conclusion