The lecture explores LLM tuning, focusing on preference tuning to align models with human preferences. It addresses why a third tuning step is needed beyond pre-training and fine-tuning, highlighting the difficulty and time involved in creating high-quality, unbiased datasets for the SFT stage. The discussion covers data collection methods, including pointwise, pairwise, and list-wise approaches, with a focus on pairwise preference data. The lecture introduces Reinforcement Learning from Human Feedback (RLHF) and its two stages: distinguishing good from bad outputs and using rewards to align the model with preferences. It also covers PPO (proximal policy optimization), KL divergence, and challenges in RL-based approaches, and introduces DPO (Direct Preference Optimization) as a supervised alternative.
Sign in to continue reading, translating and more.
Continue