This presentation explores a new checkpoint uploading SSD aimed at enhancing the training of Large Language Models (LLMs). The main concept involves shifting the less demanding optimizer step of the training process to the SSD. This shift alleviates the bandwidth strain on PCIe and network connections between the GPU and storage. As a result, it reduces bandwidth usage, lowers GPU memory requirements, allowing for the training of larger models and increased batch sizes, and speeds up checkpoint restoration times. Nonetheless, there are still challenges in optimizing storage for multi-tenant environments.
Sign in to continue reading, translating and more.
Continue