Ashmitha and Min, engineers at Meta, discuss optimizations to their communication and transport stack for high-performance Llama4 training. They address the challenges of scaling model size and complexity, including expanding AI clusters to over 100k GPUs and introducing expert parallelism. They cover the network topology, highlighting the trade-offs between job scale and latency, and introduce Balanced Hierarchical Allocation to map parallelisms to network layers. They detail the evolution of their comms library, including a zero-copy architecture and host-driven algorithm framework within their NICOL-X library, and dynamic Q-Pair load balancing to improve flow control and load balancing. They conclude by discussing future directions, including resource-efficient communication, inference workload optimization, and support for heterogeneous and long-distance communication, emphasizing the importance of cross-layer co-design for large language model training and inference.
Sign in to continue reading, translating and more.
Continue