YouTube14 Nov 2025
1h 47m

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 5 - LLM tuning

Podcast cover

Stanford Online

The lecture explores LLM tuning, focusing on preference tuning to align models with human preferences. It addresses why a third tuning step is needed beyond pre-training and fine-tuning, highlighting the difficulty and time involved in creating high-quality, unbiased datasets for the SFT stage. The discussion covers data collection methods, including pointwise, pairwise, and list-wise approaches, with a focus on pairwise preference data. The lecture introduces Reinforcement Learning from Human Feedback (RLHF) and its two stages: distinguishing good from bad outputs and using rewards to align the model with preferences. It also covers PPO (proximal policy optimization), KL divergence, and challenges in RL-based approaches, and introduces DPO (Direct Preference Optimization) as a supervised alternative.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise