Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 11: Scaling laws 2 | Stanford Online

In this lecture, the speaker delves into scaling laws for large language models, emphasizing their practical application in model building. The discussion covers case studies of models like Cerebrus GPT, Mini-CPM, and DeepSeq, highlighting their unique scaling strategies and lessons learned. A significant portion of the lecture is dedicated to the MUP method, explaining its mathematical underpinnings and how it stabilizes hyperparameters across different scales. The speaker also touches on more recent models like LLAMA3, Hunyuan-Large, and Minimax-01, and introduces the WSD learning rate schedule for efficient data scaling. The lecture aims to equip listeners with a comprehensive understanding of scaling strategies used in real-world language models.

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 11: Scaling laws 2

Stanford Online

Introduction to Scaling Laws and Model Design

Cerebrus GPT: Scaling Strategies and MuP Validation

MiniCPM: Training Small Models with High Compute and MuP

MiniCPM: Optimizing Batch Sizes and Introducing WSD Learning Rate

MiniCPM: Chinchilla Analysis and Scaling Laws

DeepSeq: Scaling Strategies and Performance

DeepSeq: Batch Size, Learning Rate Optimization, and WSD

Scaling Laws in Recent Models: LLAMA3 and Hunyuan-Large

Minimax One and Recap of Scaling Recipes

Introduction to MuP: Scale Invariant Hyperparameters

Deriving MuP: Initialization and Activation Stability

Deriving MuP: Learning Rates and Update Size

MuP Recap and Architecture Assumptions

Empirical Aspects of MuP: Large-Scale Exploration

MuP Ablations and Scaling Results

Conclusion: Scaling in the Wild

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 11: Scaling laws 2

Stanford Online

00:04Introduction to Scaling Laws and Model Design

Introduction to Scaling Laws and Model Design

05:30Cerebrus GPT: Scaling Strategies and MuP Validation

Cerebrus GPT: Scaling Strategies and MuP Validation

10:57MiniCPM: Training Small Models with High Compute and MuP

MiniCPM: Training Small Models with High Compute and MuP

17:03MiniCPM: Optimizing Batch Sizes and Introducing WSD Learning Rate

MiniCPM: Optimizing Batch Sizes and Introducing WSD Learning Rate

23:38MiniCPM: Chinchilla Analysis and Scaling Laws

MiniCPM: Chinchilla Analysis and Scaling Laws

29:11DeepSeq: Scaling Strategies and Performance

DeepSeq: Scaling Strategies and Performance

33:05DeepSeq: Batch Size, Learning Rate Optimization, and WSD

DeepSeq: Batch Size, Learning Rate Optimization, and WSD

38:56Scaling Laws in Recent Models: LLAMA3 and Hunyuan-Large

Scaling Laws in Recent Models: LLAMA3 and Hunyuan-Large

44:41Minimax One and Recap of Scaling Recipes

Minimax One and Recap of Scaling Recipes

48:24Introduction to MuP: Scale Invariant Hyperparameters

Introduction to MuP: Scale Invariant Hyperparameters

53:01Deriving MuP: Initialization and Activation Stability

Deriving MuP: Initialization and Activation Stability

57:31Deriving MuP: Learning Rates and Update Size

Deriving MuP: Learning Rates and Update Size

1:02:25MuP Recap and Architecture Assumptions

MuP Recap and Architecture Assumptions

1:07:09Empirical Aspects of MuP: Large-Scale Exploration

Empirical Aspects of MuP: Large-Scale Exploration

1:11:27MuP Ablations and Scaling Results

MuP Ablations and Scaling Results

1:17:26Conclusion: Scaling in the Wild

Conclusion: Scaling in the Wild