Enhancing Runtime Reliability in LLM Training via Fine-Grained Observability - Live from SCC | @Scale

Lei Zhang from ByteDance discusses improving the reliability of large language model (LLM) training at runtime, addressing the challenges that arise with the increasing scale of these jobs. He introduces Minder, a fault detection system leveraging unsupervised learning through VAE-based parametric models and similarity-based checks to identify faulty machines, reducing detection time by 99% compared to manual diagnosis. Zhang also presents MyCraft, a tracing system for CCL-level observability, which addresses gray failures by tracing dependencies in collective communication. MyCraft detects faults by monitoring a subset of GPUs and performing lightweight dependency-driven root cause analysis, proving successful in fault injection experiments by detecting and localizing various failure types.

Outlines

Part 1: Challenges, Context

Part 2: Metric-Based Detection, Minder

Part 3: Communication Observability, MyCraft

Part 4: Conclusion, Future Outlook

Sign in to continue reading, translating and more.

Continue

Enhancing Runtime Reliability in LLM Training via Fine-Grained Observability - Live from SCC

@Scale

Part 1: Challenges, Context

The Growing Need for Automated Systems in Large Language Model Training

Limitations of Existing Observability and Challenges in Fault Detection

Part 2: Metric-Based Detection, Minder

Leveraging Metric Correlation and Abnormal Continuity for Fault Detection

Minder: A Fault Detection System for Large-Language Model Training

Part 3: Communication Observability, MyCraft

The Need for Collective Communication Library Observability

Revealing Dependencies in Collective Communication for Debugging

MyCraft: A Tracing System for CCL-Level Observability and Root Cause Analysis

MyCraft Implementation, Performance, and Fault Injection Experiments

Part 4: Conclusion, Future Outlook

Conclusion: The Importance of System-Level Observability in Machine Learning Infrastructure

Enhancing Runtime Reliability in LLM Training via Fine-Grained Observability - Live from SCC

@Scale

Part 1: Challenges, Context

00:02The Growing Need for Automated Systems in Large Language Model Training

The Growing Need for Automated Systems in Large Language Model Training

01:05Limitations of Existing Observability and Challenges in Fault Detection

Limitations of Existing Observability and Challenges in Fault Detection

Part 2: Metric-Based Detection, Minder

04:36Leveraging Metric Correlation and Abnormal Continuity for Fault Detection

Leveraging Metric Correlation and Abnormal Continuity for Fault Detection

05:48Minder: A Fault Detection System for Large-Language Model Training

Minder: A Fault Detection System for Large-Language Model Training

Part 3: Communication Observability, MyCraft

08:44The Need for Collective Communication Library Observability

The Need for Collective Communication Library Observability

11:49Revealing Dependencies in Collective Communication for Debugging

Revealing Dependencies in Collective Communication for Debugging

14:16MyCraft: A Tracing System for CCL-Level Observability and Root Cause Analysis

MyCraft: A Tracing System for CCL-Level Observability and Root Cause Analysis

17:15MyCraft Implementation, Performance, and Fault Injection Experiments

MyCraft Implementation, Performance, and Fault Injection Experiments

Part 4: Conclusion, Future Outlook

19:29Conclusion: The Importance of System-Level Observability in Machine Learning Infrastructure

Conclusion: The Importance of System-Level Observability in Machine Learning Infrastructure