In this video, Ahmed Taha explains how Fully Sharded Data Parallel (FSDP) works, beginning with definitions and a recap of Distributed Data Parallel (DDP). He details the memory usage of GPU during deep network training, focusing on parameters, gradients, and optimizer states, particularly with the Adam optimizer and mixed precision (float16). Ahmed then covers NCCL, a library for multi-node communication, and contrasts FSDP with other model parallelism techniques like layer, tensor, and pipeline parallelism. He explains the four key parts of FSDP: FSDP units, sharding, all gather, and reduce scatter, illustrating these with examples. Finally, he discusses when to use FSDP versus DDP, highlighting the trade-offs between GPU memory and time, and the ease of switching from DDP to FSDP in PyTorch, also covering reasons not to use FSDP, such as for smaller models or when memory consumption by network activations is high.
Sign in to continue reading, translating and more.
Continue