The podcast discusses Mixture of Experts (MOEs) architectures in modern high-performance systems, highlighting their advantage over dense architectures in terms of parameter efficiency and performance. It covers the basic components of MOEs, focusing on sparsely activated experts within MLPs, and explains how MOEs achieve more parameters without increasing computational cost. The lecture also explores different routing mechanisms, training strategies, and stability concerns associated with MOEs, referencing key papers and implementations like DeepSeq v3. It further addresses practical aspects such as expert and device-level balancing, communication costs, and potential stochasticity induced by token dropping, concluding with an overview of the DeepSeq MOE architecture and its evolution.
Sign in to continue reading, translating and more.
Continue