YouTube18 Aug 2023
10m

W1 14 Computational challenges of training LLMs

Podcast cover

AI Thought

This episode explores the memory challenges encountered when training large language models (LLMs), particularly the frequent "out-of-memory" errors on NVIDIA GPUs. Against this backdrop, the speaker introduces CUDA and its role in accelerating deep learning operations, highlighting that the sheer size of LLMs necessitates substantial GPU RAM. More significantly, the discussion delves into quantization as a memory reduction technique, explaining how reducing the precision of model weights from 32-bit (FP32) to 16-bit (FP16) or 8-bit (INT8) lowers memory consumption. For instance, the speaker illustrates this with the example of representing pi in different precisions, showing the trade-off between precision and memory. The speaker then introduces bfloat16 (BF16), a hybrid precision format offering a balance between memory efficiency and training stability. Finally, the episode concludes by emphasizing that while quantization significantly reduces memory needs, training extremely large LLMs often requires distributed computing across multiple GPUs due to their massive parameter counts, making pre-training from scratch impractical for most users.

Outlines

Sign in to continue reading, translating and more.

Continue
 
mindmap screenshot
Preview
preview episode cover
How to Get Rich: Every EpisodeNaval