Lei Zhang from ByteDance discusses improving the reliability of large language model (LLM) training at runtime, addressing the challenges that arise with the increasing scale of these jobs. He introduces Minder, a fault detection system leveraging unsupervised learning through VAE-based parametric models and similarity-based checks to identify faulty machines, reducing detection time by 99% compared to manual diagnosis. Zhang also presents MyCraft, a tracing system for CCL-level observability, which addresses gray failures by tracing dependencies in collective communication. MyCraft detects faults by monitoring a subset of GPUs and performing lightweight dependency-driven root cause analysis, proving successful in fault injection experiments by detecting and localizing various failure types.
Sign in to continue reading, translating and more.
Continue