YouTube02 Jun 2025
1h 18m

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 11: Scaling laws 2

Podcast cover

Stanford Online

In this lecture, the speaker delves into scaling laws for large language models, emphasizing their practical application in model building. The discussion covers case studies of models like Cerebrus GPT, Mini-CPM, and DeepSeq, highlighting their unique scaling strategies and lessons learned. A significant portion of the lecture is dedicated to the MUP method, explaining its mathematical underpinnings and how it stabilizes hyperparameters across different scales. The speaker also touches on more recent models like LLAMA3, Hunyuan-Large, and Minimax-01, and introduces the WSD learning rate schedule for efficient data scaling. The lecture aims to equip listeners with a comprehensive understanding of scaling strategies used in real-world language models.

Outlines

Part 1: Introduction to Scaling Laws

Part 2: Scaling in Recent Models

Part 3: Understanding MuP

Part 4: Conclusion

Sign in to continue reading, translating and more.

Open full episode in Podwise