In this monologue, Lei Zhang, a research scientist at ByteDance, discusses methods for improving the reliability of large-language model training at runtime. The talk highlights the challenges of debugging faults in large-scale GPU training environments due to noisy metrics, fault propagation, and task-dependent anomalies. Zhang introduces Minder, a fault detection system using VAE-based parametric models and similarity-based checks to accurately identify faulty machines. Additionally, the talk addresses gray failures in collective communication libraries (CCL) and introduces Minecraft, a tracing system that achieves CCL-level observability by tracing dependencies to enable root cause analysis at runtime, ultimately improving robustness and reducing wasted compute in ML infrastructure.
Sign in to continue reading, translating and more.
Continue