Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 15: Alignment - SFT/RLHF | Stanford Online

The lecture focuses on post-training techniques for large language models, transitioning from pre-training to making models useful and safe. It covers two main areas: supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). The discussion includes the importance of high-quality data, the challenges in collecting it, and the trade-offs between capabilities and safety. The lecture also touches on algorithmic questions, such as how to effectively use different types of data and scale up the training process, and it introduces methods like Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO). The content also explores the nuances of instruction tuning, including the potential for models to hallucinate and the impact of data length and style on model behavior.

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 15: Alignment - SFT/RLHF

Stanford Online

Introduction to Post-Training and the Shift from Pre-training

Analyzing Instruction Tuning Datasets: Flan, Alpaca, and OpenAssistant

Length Biases, Knowledge Depth, and Hallucination Risks in Instruction Tuning

Safety Tuning and Scaling Instruction Tuning with Pre-training

Catastrophic Forgetting, Citations, and the Transition to Reinforcement Learning

Data Collection for Pairwise Feedback and Annotation Guidelines

Bias, AI Feedback, and Length Effects in RLHF Data

RLHF Methods: PPO, DPO, and Deriving the DPO Formula

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 15: Alignment - SFT/RLHF

Stanford Online

00:04Introduction to Post-Training and the Shift from Pre-training

Introduction to Post-Training and the Shift from Pre-training

05:54Analyzing Instruction Tuning Datasets: Flan, Alpaca, and OpenAssistant

Analyzing Instruction Tuning Datasets: Flan, Alpaca, and OpenAssistant

17:26Length Biases, Knowledge Depth, and Hallucination Risks in Instruction Tuning

Length Biases, Knowledge Depth, and Hallucination Risks in Instruction Tuning

24:21Safety Tuning and Scaling Instruction Tuning with Pre-training

Safety Tuning and Scaling Instruction Tuning with Pre-training

32:26Catastrophic Forgetting, Citations, and the Transition to Reinforcement Learning

Catastrophic Forgetting, Citations, and the Transition to Reinforcement Learning

43:22Data Collection for Pairwise Feedback and Annotation Guidelines

Data Collection for Pairwise Feedback and Annotation Guidelines

55:09Bias, AI Feedback, and Length Effects in RLHF Data

Bias, AI Feedback, and Length Effects in RLHF Data

1:05:27RLHF Methods: PPO, DPO, and Deriving the DPO Formula

RLHF Methods: PPO, DPO, and Deriving the DPO Formula