YouTube24 Apr 2025
1h 22m

Stanford CS336 Language Modeling from Scratch | Spring 2025 | Lecture 4: Mixture of experts

Podcast cover

Stanford Online

The podcast discusses Mixture of Experts (MOEs) architectures in modern high-performance systems, highlighting their advantage over dense architectures in terms of parameter efficiency and performance. It covers the basic components of MOEs, focusing on sparsely activated experts within MLPs, and explains how MOEs achieve more parameters without increasing computational cost. The lecture also explores different routing mechanisms, training strategies, and stability concerns associated with MOEs, referencing key papers and implementations like DeepSeq v3. It further addresses practical aspects such as expert and device-level balancing, communication costs, and potential stochasticity induced by token dropping, concluding with an overview of the DeepSeq MOE architecture and its evolution.

Outlines

Part 1: MOE Introduction and Routing

Part 2: Expert Configurations and Training

Part 3: Systems and Architectures

Sign in to continue reading, translating and more.

Open full episode in Podwise