Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 5 - LLM tuning | Stanford Online

The lecture explores LLM tuning, focusing on preference tuning to align models with human preferences. It addresses why a third tuning step is needed beyond pre-training and fine-tuning, highlighting the difficulty and time involved in creating high-quality, unbiased datasets for the SFT stage. The discussion covers data collection methods, including pointwise, pairwise, and list-wise approaches, with a focus on pairwise preference data. The lecture introduces Reinforcement Learning from Human Feedback (RLHF) and its two stages: distinguishing good from bad outputs and using rewards to align the model with preferences. It also covers PPO (proximal policy optimization), KL divergence, and challenges in RL-based approaches, and introduces DPO (Direct Preference Optimization) as a supervised alternative.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 5 - LLM tuning

Stanford Online

Transitioning from Pre-training and Supervised Fine-Tuning to LLM Alignment

Collecting Pairwise Human Preference Data for Model Alignment

Training Reward Models Using the Bradley-Terry Probabilistic Formulation

Optimizing Policies with PPO and Mitigating Reward Hacking

Direct Preference Optimization as a Simplified Alternative to RLHF

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 5 - LLM tuning

Stanford Online

00:05Transitioning from Pre-training and Supervised Fine-Tuning to LLM Alignment

Transitioning from Pre-training and Supervised Fine-Tuning to LLM Alignment

05:01Collecting Pairwise Human Preference Data for Model Alignment

Collecting Pairwise Human Preference Data for Model Alignment

18:22Training Reward Models Using the Bradley-Terry Probabilistic Formulation

Training Reward Models Using the Bradley-Terry Probabilistic Formulation

46:07Optimizing Policies with PPO and Mitigating Reward Hacking

Optimizing Policies with PPO and Mitigating Reward Hacking

1:22:47Direct Preference Optimization as a Simplified Alternative to RLHF

Direct Preference Optimization as a Simplified Alternative to RLHF