Performance Optimizations at 100K+ Scale by Ashmitha Jeevaraj Shetty and Min Si

Ashmitha and Min, engineers at Meta, discuss optimizations to their communication and transport stack for high-performance Llama4 training. They address the challenges of scaling model size and complexity, including expanding AI clusters to over 100k GPUs and introducing expert parallelism. They cover the network topology, highlighting the trade-offs between job scale and latency, and introduce Balanced Hierarchical Allocation to map parallelisms to network layers. They detail the evolution of their comms library, including a zero-copy architecture and host-driven algorithm framework within their NICOL-X library, and dynamic Q-Pair load balancing to improve flow control and load balancing. They conclude by discussing future directions, including resource-efficient communication, inference workload optimization, and support for heterogeneous and long-distance communication, emphasizing the importance of cross-layer co-design for large language model training and inference.

Outlines

Sign in to continue reading, translating and more.

Continue

@Scale

Introduction to Llama4 Training and Network Scaling Challenges

Software Stack Innovations: Zero-Copy Architecture and Host-Driven Algorithms

Custom Transport Stack: Dynamic Q-Pair Load Balancing

Future Directions and Key Takeaways

Performance Optimizations at 100K+ Scale by Ashmitha Jeevaraj Shetty and Min Si

@Scale

00:05Introduction to Llama4 Training and Network Scaling Challenges

Introduction to Llama4 Training and Network Scaling Challenges

05:11Software Stack Innovations: Zero-Copy Architecture and Host-Driven Algorithms

Software Stack Innovations: Zero-Copy Architecture and Host-Driven Algorithms

12:40Custom Transport Stack: Dynamic Q-Pair Load Balancing

Custom Transport Stack: Dynamic Q-Pair Load Balancing

15:50Future Directions and Key Takeaways

Future Directions and Key Takeaways