The lecture focuses on multi-machine optimization, specifically parallelism across machines, to train large models that exceed single GPU capacity. It covers networking basics, mapping hardware to parallelization strategies, and case studies. The discussion includes compute and memory concerns, heterogeneous communication, and various parallelization paradigms. It also covers collective communication operations like all-reduce, broadcast, and all-gather, highlighting the equivalence between all-reduce and reduced scatter followed by all-gather. The lecture further explains data parallelism, model parallelism (pipeline and tensor), and activation parallelism, including optimization techniques like optimizer state sharding (ZeRO) and Fully Sharded Data Parallelism (FSDP), and concludes with examples of how these strategies are used in large-scale distributed training runs, emphasizing the importance of balancing memory, bandwidth, and batch size.
Sign in to continue reading, translating and more.
Continue