YouTube27 Aug 2023

How Fully Sharded Data Parallel (FSDP) works?

Podcast cover

Ahmed Taha

In this video, Ahmed Taha explains how Fully Sharded Data Parallel (FSDP) works, beginning with definitions and a recap of Distributed Data Parallel (DDP). He details the memory usage of GPU during deep network training, focusing on parameters, gradients, and optimizer states, particularly with the Adam optimizer and mixed precision (float16). Ahmed then covers NCCL, a library for multi-node communication, and contrasts FSDP with other model parallelism techniques like layer, tensor, and pipeline parallelism. He explains the four key parts of FSDP: FSDP units, sharding, all gather, and reduce scatter, illustrating these with examples. Finally, he discusses when to use FSDP versus DDP, highlighting the trade-offs between GPU memory and time, and the ease of switching from DDP to FSDP in PyTorch, also covering reasons not to use FSDP, such as for smaller models or when memory consumption by network activations is high.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise