Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 10 - Post-training by Archit Sharma | Stanford Online

This episode explores the evolution of large language models (LLMs) from pre-trained models to sophisticated conversational AI like ChatGPT. Against the backdrop of exponentially increasing computational power and training data (from billions to trillions of tokens), the lecture details how LLMs initially focused on next-token prediction, inadvertently learning complex reasoning and problem-solving abilities. More significantly, the discussion pivots to techniques like zero-shot and few-shot learning, where models perform tasks with minimal or no explicit training examples; for instance, using prompts like "TLDR" to achieve summarization. However, limitations emerged, prompting the exploration of instruction fine-tuning, where models are trained on diverse instructions and outputs across various tasks, significantly improving their alignment with user intent. Finally, the lecture delves into Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO), methods aimed at directly optimizing models for human preferences, leading to more natural and helpful responses; for example, InstructGPT and ChatGPT exemplify this advanced stage. What this means for the future of LLMs is a continued focus on refining these techniques, particularly DPO's potential for wider accessibility in open-source development, while addressing challenges like reward hacking and the inherent biases in training data.

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 10 - Post-training by Archit Sharma

Stanford Online

Introduction to Post-Training of Large Language Models

Zero-Shot and Few-Shot In-Context Learning

Limitations of In-Context Learning and the Need for Instruction Fine-Tuning

Instruction Fine-Tuning and its Challenges

Limitations of Instruction Fine-Tuning and the Need for Preference Optimization

Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO)

DPO in Practice, Model Comparisons, and Future Directions

Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 10 - Post-training by Archit Sharma

Stanford Online

00:05Introduction to Post-Training of Large Language Models

Introduction to Post-Training of Large Language Models

07:22Zero-Shot and Few-Shot In-Context Learning

Zero-Shot and Few-Shot In-Context Learning

17:11Limitations of In-Context Learning and the Need for Instruction Fine-Tuning

Limitations of In-Context Learning and the Need for Instruction Fine-Tuning

22:29Instruction Fine-Tuning and its Challenges

Instruction Fine-Tuning and its Challenges

31:56Limitations of Instruction Fine-Tuning and the Need for Preference Optimization

Limitations of Instruction Fine-Tuning and the Need for Preference Optimization

37:20Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO)

Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO)

57:05DPO in Practice, Model Comparisons, and Future Directions

DPO in Practice, Model Comparisons, and Future Directions