Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 4: Mixture of experts | Stanford Online

The podcast discusses Mixture of Experts (MOEs) architectures in modern high-performance systems, highlighting their advantage over dense architectures in terms of parameter efficiency and performance. It covers the basic components of MOEs, focusing on sparsely activated experts within MLPs, and explains how MOEs achieve more parameters without increasing computational cost. The lecture also explores different routing mechanisms, training strategies, and stability concerns associated with MOEs, referencing key papers and implementations like DeepSeq v3. It further addresses practical aspects such as expert and device-level balancing, communication costs, and potential stochasticity induced by token dropping, concluding with an overview of the DeepSeq MOE architecture and its evolution.

Outlines

Part 1: MOE Introduction and Routing

Part 2: Expert Configurations and Training

Part 3: Systems and Architectures

Sign in to continue reading, translating and more.

Open full episode in Podwise

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 4: Mixture of experts

Stanford Online

Part 1: MOE Introduction and Routing

Introduction to Mixture of Experts (MOEs)

MOEs and Parallelism: Advantages and Complexities

Routing Mechanisms in MOEs: Token Choice vs. Expert Choice

Top-K Routing in Detail and Router Parameterization

Part 2: Expert Configurations and Training

Shared and Fine-Grained Experts

Common MOE Configurations and Training Challenges

Training MOEs: Stochastic Approximations and Loss Balancing

DeepSeek's Innovations in Loss Balancing and the Importance of Expert Balancing

Part 3: Systems and Architectures

Systems Aspects of MOEs: Expert Parallelism and Stability Concerns

Fine-Tuning MOEs, Upcycling, and DeepSeek MOE Architecture

DeepSeek v3 Architecture and Non-MOE Components

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 4: Mixture of experts

Stanford Online

Part 1: MOE Introduction and Routing

00:04Introduction to Mixture of Experts (MOEs)

Introduction to Mixture of Experts (MOEs)

08:04MOEs and Parallelism: Advantages and Complexities

MOEs and Parallelism: Advantages and Complexities

15:26Routing Mechanisms in MOEs: Token Choice vs. Expert Choice

Routing Mechanisms in MOEs: Token Choice vs. Expert Choice

23:40Top-K Routing in Detail and Router Parameterization

Top-K Routing in Detail and Router Parameterization

Part 2: Expert Configurations and Training

30:31Shared and Fine-Grained Experts

Shared and Fine-Grained Experts

37:08Common MOE Configurations and Training Challenges

Common MOE Configurations and Training Challenges

44:06Training MOEs: Stochastic Approximations and Loss Balancing

Training MOEs: Stochastic Approximations and Loss Balancing

52:10DeepSeek's Innovations in Loss Balancing and the Importance of Expert Balancing

DeepSeek's Innovations in Loss Balancing and the Importance of Expert Balancing

Part 3: Systems and Architectures

59:42Systems Aspects of MOEs: Expert Parallelism and Stability Concerns

Systems Aspects of MOEs: Expert Parallelism and Stability Concerns

1:07:11Fine-Tuning MOEs, Upcycling, and DeepSeek MOE Architecture

Fine-Tuning MOEs, Upcycling, and DeepSeek MOE Architecture

1:16:17DeepSeek v3 Architecture and Non-MOE Components

DeepSeek v3 Architecture and Non-MOE Components