Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 9: Scaling laws 1 | Stanford Online

The podcast focuses on scaling laws in machine learning, particularly for large language models (LLMs). It begins by framing the challenge of building the best open-source LLM with limited resources, emphasizing the need to innovate rather than just copy existing models. The discussion covers the history and background of scaling laws, highlighting their grounded nature and evolution from theoretical machine learning to empirical applications. It explores data scaling, model scaling, and the interplay between data, model size, and compute, including the Chinchilla scaling laws, and addresses practical engineering decisions like hyperparameter tuning, architecture selection, and resource allocation, and the trade-offs between model size and data set size.

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 9: Scaling laws 1

Stanford Online

00:04Introduction to Scaling Laws for Language Models

Introduction to Scaling Laws for Language Models

00:43Innovating Beyond Existing Architectures

Innovating Beyond Existing Architectures

01:34The Purpose and History of Scaling Laws

The Purpose and History of Scaling Laws

03:36Scaling Laws as Statistical Machine Learning

Scaling Laws as Statistical Machine Learning

05:01Early Research on Scaling Laws

Early Research on Scaling Laws

06:18Key Contributions to the Modern Mindset of Scaling

Key Contributions to the Modern Mindset of Scaling

07:42Hessness et al. and the Three Regions of Model Behavior

Hessness et al. and the Three Regions of Model Behavior

09:15Discussion on Data Scaling and Non-Scaling Behavior

Discussion on Data Scaling and Non-Scaling Behavior

11:13Out-of-Distribution Scenarios and the Transition to LLM Scaling

Out-of-Distribution Scenarios and the Transition to LLM Scaling

12:23Empirical Observations of Scaling Laws

Empirical Observations of Scaling Laws

14:16Data Scaling and Polynomial Decay

Data Scaling and Polynomial Decay

15:40Estimating the Mean and Expected Scaling Slopes

Estimating the Mean and Expected Scaling Slopes

17:27Nonparametric Function Classes and Dimensionality

Nonparametric Function Classes and Dimensionality

19:38Intrinsic Dimensionality and Data Scaling Applications

Intrinsic Dimensionality and Data Scaling Applications

21:14Data Set Composition and Diminishing Returns

Data Set Composition and Diminishing Returns

22:43Multi-Epoch Training and Data Selection

Multi-Epoch Training and Data Selection

24:24Summary of Data Scaling and Model Size Considerations

Summary of Data Scaling and Model Size Considerations

25:55Introduction to Model Scaling

Introduction to Model Scaling

27:06Architecture Scaling: Transformers vs. LSTMs

Architecture Scaling: Transformers vs. LSTMs

28:25Scaling Laws for Architecture Selection

Scaling Laws for Architecture Selection

29:35Optimizer Choice and Hyperparameter Scaling

Optimizer Choice and Hyperparameter Scaling

31:02Parameter Scaling and Hyperparameter Tuning

Parameter Scaling and Hyperparameter Tuning

32:25Scale-Aware Hyperparameter Tuning

Scale-Aware Hyperparameter Tuning

33:35Batch Size and Learning Rate Considerations

Batch Size and Learning Rate Considerations

34:57Batch Size and Critical Batch Size

Batch Size and Critical Batch Size

36:13Empirical Analysis of Critical Batch Size

Empirical Analysis of Critical Batch Size

37:24Scaling Compute and Model Size with Batch Size

Scaling Compute and Model Size with Batch Size

38:42Learning Rate and Model Re-parameterization

Learning Rate and Model Re-parameterization

40:57MuP and Scale-Aware Initializations

MuP and Scale-Aware Initializations

42:33Scaling Laws and Downstream Tasks

Scaling Laws and Downstream Tasks

44:02Scaling Law-Based Design Procedure

Scaling Law-Based Design Procedure

46:43Data Efficiency and Model Size

Data Efficiency and Model Size

47:54Joint Data Model Scaling Laws

Joint Data Model Scaling Laws

49:17Extrapolation and Joint Scaling

Extrapolation and Joint Scaling

50:25Chinchilla and Optimal Scaling

Chinchilla and Optimal Scaling

51:28Chinchilla's Methods for Estimating Optimal Trade-offs

Chinchilla's Methods for Estimating Optimal Trade-offs

53:48Chinchilla Method 1: Minimum Over Curves

Chinchilla Method 1: Minimum Over Curves

55:44Chinchilla Method 2: Isoflop Analysis

Chinchilla Method 2: Isoflop Analysis

57:52Chinchilla Method 3: Curve Fitting

Chinchilla Method 3: Curve Fitting