How Fully Sharded Data Parallel (FSDP) works?

In this video, Ahmed Taha explains how Fully Sharded Data Parallel (FSDP) works, beginning with definitions and a recap of Distributed Data Parallel (DDP). He details the memory usage of GPU during deep network training, focusing on parameters, gradients, and optimizer states, particularly with the Adam optimizer and mixed precision (float16). Ahmed then covers NCCL, a library for multi-node communication, and contrasts FSDP with other model parallelism techniques like layer, tensor, and pipeline parallelism. He explains the four key parts of FSDP: FSDP units, sharding, all gather, and reduce scatter, illustrating these with examples. Finally, he discusses when to use FSDP versus DDP, highlighting the trade-offs between GPU memory and time, and the ease of switching from DDP to FSDP in PyTorch, also covering reasons not to use FSDP, such as for smaller models or when memory consumption by network activations is high.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Ahmed Taha

Introduction to Fully Sharded Data Parallel (FSDP)

Distributed Data Parallel (DDP) and NCCL Library

DDP Recap and Introduction to Model Parallelism Techniques

Key Components of FSDP: Units, Sharding, and All Gather

All Gather Operation in FSDP

Reduce Scatter Operation and FSDP Training Process

When to Use and Not Use FSDP

How Fully Sharded Data Parallel (FSDP) works?

Ahmed Taha

00:00Introduction to Fully Sharded Data Parallel (FSDP)

Introduction to Fully Sharded Data Parallel (FSDP)

03:47Distributed Data Parallel (DDP) and NCCL Library

Distributed Data Parallel (DDP) and NCCL Library

07:36DDP Recap and Introduction to Model Parallelism Techniques

DDP Recap and Introduction to Model Parallelism Techniques

11:27Key Components of FSDP: Units, Sharding, and All Gather

Key Components of FSDP: Units, Sharding, and All Gather

17:31All Gather Operation in FSDP

All Gather Operation in FSDP

23:37Reduce Scatter Operation and FSDP Training Process

Reduce Scatter Operation and FSDP Training Process

27:05When to Use and Not Use FSDP

When to Use and Not Use FSDP