Enhancing Runtime Reliability in LLM Training via Fine-Grained Observability by Lei Zhang

In this monologue, Lei Zhang, a research scientist at ByteDance, discusses methods for improving the reliability of large-language model training at runtime. The talk highlights the challenges of debugging faults in large-scale GPU training environments due to noisy metrics, fault propagation, and task-dependent anomalies. Zhang introduces Minder, a fault detection system using VAE-based parametric models and similarity-based checks to accurately identify faulty machines. Additionally, the talk addresses gray failures in collective communication libraries (CCL) and introduces Minecraft, a tracing system that achieves CCL-level observability by tracing dependencies to enable root cause analysis at runtime, ultimately improving robustness and reducing wasted compute in ML infrastructure.

Outlines

Sign in to continue reading, translating and more.

Continue

@Scale

Introduction to Reliability Challenges in Large Language Model Training

Minder: A Fault Detection System for LLM Training

Addressing Gray Failures with Collective Communication Library Observability

Minecraft: A Tracing System for Runtime CCL Observability and Root Cause Analysis

Enhancing Runtime Reliability in LLM Training via Fine-Grained Observability by Lei Zhang

@Scale

00:05Introduction to Reliability Challenges in Large Language Model Training

Introduction to Reliability Challenges in Large Language Model Training

05:02Minder: A Fault Detection System for LLM Training

Minder: A Fault Detection System for LLM Training

10:20Addressing Gray Failures with Collective Communication Library Observability

Addressing Gray Failures with Collective Communication Library Observability

15:12Minecraft: A Tracing System for Runtime CCL Observability and Root Cause Analysis

Minecraft: A Tracing System for Runtime CCL Observability and Root Cause Analysis