Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 3 - Tranformers & Large Language Models | Stanford Online

The lecture introduces Large Language Models (LLMs), defining them as large-scale language models predicting token sequences, emphasizing their size in parameters, training data, and computational needs. It distinguishes LLMs from earlier models like BERT, highlighting the decoder-only architecture and introduces Mixture of Experts (MOE) to optimize computational efficiency by activating subsets of model parameters. The discussion covers dense versus sparse MOEs and techniques to prevent routing collapse during training. It also explores response generation, contrasting greedy decoding and beam search with sampling methods like top-K and top-P sampling, alongside the impact of temperature on output diversity. The lecture concludes with strategies to improve LLM efficiency, including KV caching, group query attention, and speculative decoding.

Outlines

Part 1: Introduction, LLM Fundamentals

Part 2: Mixture of Experts (MOE)

Part 3: Decoding, Sampling Strategies

Part 4: Prompt Engineering, In-Context Learning

Part 5: Efficient Inference, Optimization

Sign in to continue reading, translating and more.

Open full episode in Podwise

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 3 - Tranformers & Large Language Models

Stanford Online

Part 1: Introduction, LLM Fundamentals

Course Announcements: Slides Available Before Class on Thursday Evenings

Recap of Transformer Models: Encoder-Decoder, Encoder-Only, and Decoder-Only

Defining Large Language Models (LLMs): Size, Data, Compute, and Decoder-Only Architecture

Part 2: Mixture of Experts (MOE)

Mixture of Experts (MOE): Activating Subsets of Parameters for Efficient Computation

Dense vs. Sparse MOEs and the Routing Collapse Challenge

Mitigating Routing Collapse with Loss Function Adjustments and Noisy Gating

Differentiability, Parameter Scaling, and Attention Heads in MOEs

Gate-Driven Expert Selection and Visualization of Token Routing

Part 3: Decoding, Sampling Strategies

Response Generation: From Text-In, Text-Out to Next Token Prediction

Beam Search and Sampling Methods for Token Selection

Top-K and Top-P Sampling and the Softmax Layer

Temperature's Impact on Output Probabilities and Deterministic Outputs

Guided Decoding for Specific Output Formats

Part 4: Prompt Engineering, In-Context Learning

Prompting Strategies: Context Length and Context Rot

Structuring Prompts: Context, Query, Instructions, and Constraints

In-Context Learning: Zero-Shot vs. Few-Shot Approaches

Chain of Thought (COT): Improving Response Quality with Reasoning

Self-Consistency: Sampling and Majority Voting for Robust Answers

Part 5: Efficient Inference, Optimization

Efficient Inference: Exact vs. Approximate Techniques

Key-Value (KV) Caching: Reusing Past Computations for Efficiency

Page Attention: Managing Memory Fragmentation for Efficient Inference

Multi-Latent Attention: Compacting Key-Value Representations

Speculative Decoding: Using a Smaller Model to Aid Generation

Multi-Token Prediction: Embedding the Draft Model

Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 3 - Tranformers & Large Language Models

Stanford Online

Part 1: Introduction, LLM Fundamentals

00:05Course Announcements: Slides Available Before Class on Thursday Evenings

Course Announcements: Slides Available Before Class on Thursday Evenings

00:51Recap of Transformer Models: Encoder-Decoder, Encoder-Only, and Decoder-Only

Recap of Transformer Models: Encoder-Decoder, Encoder-Only, and Decoder-Only

03:44Defining Large Language Models (LLMs): Size, Data, Compute, and Decoder-Only Architecture

Defining Large Language Models (LLMs): Size, Data, Compute, and Decoder-Only Architecture

Part 2: Mixture of Experts (MOE)

07:31Mixture of Experts (MOE): Activating Subsets of Parameters for Efficient Computation

Mixture of Experts (MOE): Activating Subsets of Parameters for Efficient Computation

12:26Dense vs. Sparse MOEs and the Routing Collapse Challenge

Dense vs. Sparse MOEs and the Routing Collapse Challenge

21:27Mitigating Routing Collapse with Loss Function Adjustments and Noisy Gating

Mitigating Routing Collapse with Loss Function Adjustments and Noisy Gating

25:05Differentiability, Parameter Scaling, and Attention Heads in MOEs

Differentiability, Parameter Scaling, and Attention Heads in MOEs

30:34Gate-Driven Expert Selection and Visualization of Token Routing

Gate-Driven Expert Selection and Visualization of Token Routing

Part 3: Decoding, Sampling Strategies

35:52Response Generation: From Text-In, Text-Out to Next Token Prediction

Response Generation: From Text-In, Text-Out to Next Token Prediction

41:21Beam Search and Sampling Methods for Token Selection

Beam Search and Sampling Methods for Token Selection

47:25Top-K and Top-P Sampling and the Softmax Layer

Top-K and Top-P Sampling and the Softmax Layer

53:12Temperature's Impact on Output Probabilities and Deterministic Outputs

Temperature's Impact on Output Probabilities and Deterministic Outputs

1:04:08Guided Decoding for Specific Output Formats

Guided Decoding for Specific Output Formats

Part 4: Prompt Engineering, In-Context Learning

1:07:08Prompting Strategies: Context Length and Context Rot

Prompting Strategies: Context Length and Context Rot

1:11:41Structuring Prompts: Context, Query, Instructions, and Constraints

Structuring Prompts: Context, Query, Instructions, and Constraints

1:14:06In-Context Learning: Zero-Shot vs. Few-Shot Approaches

In-Context Learning: Zero-Shot vs. Few-Shot Approaches

1:18:31Chain of Thought (COT): Improving Response Quality with Reasoning

Chain of Thought (COT): Improving Response Quality with Reasoning

1:21:45Self-Consistency: Sampling and Majority Voting for Robust Answers

Self-Consistency: Sampling and Majority Voting for Robust Answers

Part 5: Efficient Inference, Optimization

1:24:51Efficient Inference: Exact vs. Approximate Techniques

Efficient Inference: Exact vs. Approximate Techniques

1:27:29Key-Value (KV) Caching: Reusing Past Computations for Efficiency

Key-Value (KV) Caching: Reusing Past Computations for Efficiency

1:32:43Page Attention: Managing Memory Fragmentation for Efficient Inference

Page Attention: Managing Memory Fragmentation for Efficient Inference

1:36:25Multi-Latent Attention: Compacting Key-Value Representations

Multi-Latent Attention: Compacting Key-Value Representations

1:41:21Speculative Decoding: Using a Smaller Model to Aid Generation

Speculative Decoding: Using a Smaller Model to Aid Generation

1:45:00Multi-Token Prediction: Embedding the Draft Model

Multi-Token Prediction: Embedding the Draft Model