Stanford CS336 Lang. Modeling from Scratch | Spring 2025 | Lec. 3: Architectures, Hyperparameters | Stanford Online

The lecture focuses on the architecture and training of Large Language Models (LLMs), diving into the details often overlooked in other courses. It begins with a recap of transformers, contrasting standard and modern variants, and then shifts to a data-driven analysis of LLM architectures, examining changes and commonalities across various models. The discussion covers architecture variations like pre-norm versus post-norm, the shift to RMS norm, and the impact of bias terms. It also explores different activation functions, particularly the rise of gated linear units (GLU), and the use of parallel layers. Furthermore, the lecture addresses hyperparameter choices, including feed-forward size, the ratio between model dimension and head dimension, aspect ratio, and vocabulary sizes. It also touches on the role of regularization, specifically weight decay, and concludes by examining stability tricks and variations in attention heads, such as GQA and MQA, and techniques for handling longer context windows.

Outlines

Sign in to continue reading, translating and more.

Continue

Stanford CS336 Lang. Modeling from Scratch | Spring 2025 | Lec. 3: Architectures, Hyperparameters

Stanford Online

Introduction to Modern LM Architecture and Training

Layer Normalization Techniques: Pre-Norm, Post-Norm, and RMS Norm

Activations and Gated Linear Units (GLU)

Serial vs. Parallel Transformer Blocks and Position Embeddings

Hyperparameter Considerations: Feed-Forward Size and Head Dimensions

Aspect Ratio, Vocabulary Sizes, and Regularization Techniques

Stability Tricks in Modern Language Model Training

Attention Head Variations: GQA, MQA, and Structured Attention Patterns

Stanford CS336 Lang. Modeling from Scratch | Spring 2025 | Lec. 3: Architectures, Hyperparameters

Stanford Online

00:05Introduction to Modern LM Architecture and Training

Introduction to Modern LM Architecture and Training

05:20Layer Normalization Techniques: Pre-Norm, Post-Norm, and RMS Norm

Layer Normalization Techniques: Pre-Norm, Post-Norm, and RMS Norm

17:55Activations and Gated Linear Units (GLU)

Activations and Gated Linear Units (GLU)

28:52Serial vs. Parallel Transformer Blocks and Position Embeddings

Serial vs. Parallel Transformer Blocks and Position Embeddings

39:43Hyperparameter Considerations: Feed-Forward Size and Head Dimensions

Hyperparameter Considerations: Feed-Forward Size and Head Dimensions

50:19Aspect Ratio, Vocabulary Sizes, and Regularization Techniques

Aspect Ratio, Vocabulary Sizes, and Regularization Techniques

1:05:15Stability Tricks in Modern Language Model Training

Stability Tricks in Modern Language Model Training

1:15:11Attention Head Variations: GQA, MQA, and Structured Attention Patterns

Attention Head Variations: GQA, MQA, and Structured Attention Patterns