Saif Hasan, a software engineer at Meta, details the challenges and solutions for scaling the training of large language models to 100,000 GPUs. The talk addresses the increasing size of LLMs and the corresponding need for more computational resources, noting that Llama4 requires training on a 100,000 GPU cluster. A significant problem is the increasing initialization time as the job scales, compounded by frequent hardware failures, which can reduce effective training time to 50%. To combat this, Hasan outlines two key strategies: improving collective initialization through optimizations like TCP store-based ring formation and bidirectional AllGathers, and implementing fault-tolerant training. Fault-tolerant training isolates failures to subsets of GPUs, allowing healthy GPUs to continue training, improving effective training time to 90%.
Sign in to continue reading, translating and more.
Continue