Archit Sharma discusses preference optimization in large language models, covering instruction fine-tuning, Reinforcement Learning from Human Preferences (RLHF), and Direct Preference Optimization (DPO). The lecture addresses how to transform pre-trained language models into helpful assistants by aligning them with human intent. Instruction fine-tuning teaches formatting but is limited by the expense of data collection and the lack of definitive answers for creative tasks. RLHF uses a reward model to align with human intent, maximizing expected rewards through policy gradient methods, while DPO simplifies the process by directly optimizing language model parameters from preference data, framing it as a binary classification problem. The discussion also explores the limitations of current paradigms, including reward hacking and the challenges of collecting representative preference data, and touches on future directions such as personalized language models and verifiable rewards.
Sign in to continue reading, translating and more.
Continue