Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 15 - After DPO by Nathan Lambert | Stanford Online

This episode explores the evolution and future of reinforcement learning from human feedback (RLHF) in large language model (LLM) development. Against the backdrop of the rapid advancements in LLMs, the discussion highlights the increasing importance of post-training fine-tuning, particularly using RLHF, as opposed to solely relying on pre-training. More significantly, the guest introduces Direct Preference Optimization (DPO), a simpler and more scalable alternative to traditional RLHF methods like Proximal Policy Optimization (PPO), and details its impact on the field. For instance, the guest recounts the story of the Zephyr model, a pivotal moment when DPO gained widespread adoption. The conversation then pivots to the challenges of evaluating RLHF methods, leading to the introduction of RewardBench, a novel evaluation tool for reward models. Finally, the discussion concludes by examining the contrasting approaches of academia and industry in RLHF, emphasizing the need for more robust datasets and the exploration of online RLHF methods to further enhance LLM performance and alignment.

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 15 - After DPO by Nathan Lambert

Stanford Online

Introduction of Nathan Lambert and the Topic: Life After DPO

The Scale of Data in Alignment Research

A Brief History of Language Models and RLHF's Emergence

Defining Key Terms in Post-Training and Alignment

RLHF Objective and the Introduction of DPO

The Path to DPO's Adoption in Model Training

Early RLHF Models and the "Uncensored" Model Trend

Zephyr Model and the Rise of DPO

Scaling DPO and the Challenges of Alignment Research

RewardBench: Evaluating Reward Models

RewardBench Results and Analysis

Safety and Reward Model Evaluation

DPO Math and the Importance of the Reference Model

Comparing DPO and PPO: Empirical Results

Online vs. Offline RLHF and the Future of DPO

Discriminatory Guided DPO and Future Directions in RLHF

Q&A Session

Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 15 - After DPO by Nathan Lambert

Stanford Online

00:05Introduction of Nathan Lambert and the Topic: Life After DPO

Introduction of Nathan Lambert and the Topic: Life After DPO

02:01The Scale of Data in Alignment Research

The Scale of Data in Alignment Research

03:14A Brief History of Language Models and RLHF's Emergence

A Brief History of Language Models and RLHF's Emergence

05:55Defining Key Terms in Post-Training and Alignment

Defining Key Terms in Post-Training and Alignment

08:36RLHF Objective and the Introduction of DPO

RLHF Objective and the Introduction of DPO

12:09The Path to DPO's Adoption in Model Training

The Path to DPO's Adoption in Model Training

15:04Early RLHF Models and the "Uncensored" Model Trend

Early RLHF Models and the "Uncensored" Model Trend

19:15Zephyr Model and the Rise of DPO

Zephyr Model and the Rise of DPO

21:18Scaling DPO and the Challenges of Alignment Research

Scaling DPO and the Challenges of Alignment Research

24:15RewardBench: Evaluating Reward Models

RewardBench: Evaluating Reward Models

27:03RewardBench Results and Analysis

RewardBench Results and Analysis

31:31Safety and Reward Model Evaluation

Safety and Reward Model Evaluation

36:30DPO Math and the Importance of the Reference Model

DPO Math and the Importance of the Reference Model

40:04Comparing DPO and PPO: Empirical Results

Comparing DPO and PPO: Empirical Results

44:11Online vs. Offline RLHF and the Future of DPO

Online vs. Offline RLHF and the Future of DPO

50:31Discriminatory Guided DPO and Future Directions in RLHF

Discriminatory Guided DPO and Future Directions in RLHF

57:30Q&A Session

Q&A Session