Training LLMs at Scale - Deepak Narayanan | Stanford MLSys #83 | Stanford MLSys Seminars

This podcast episode explores the challenges of training large language models at scale and the various forms of parallelism that can be used to address these challenges. Deepak Narayanan, a senior applied research scientist at NVIDIA, discusses the need for careful consideration of different parallelism dimensions and domain-specific optimizations to achieve efficient training. The episode highlights the benefits and complexities of parallelism, including data parallelism, tensor model parallelism, and pipeline parallelism. The interactions between tensor and pipeline model parallelism, as well as the impact of communication patterns on training speed, are also discussed. The episode concludes with a focus on the importance of optimizing throughput in distributed matrix multiplication and hints at future discussions on inference in MLSys Seminars.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Training LLMs at Scale - Deepak Narayanan | Stanford MLSys #83

Stanford MLSys Seminars

Training Large Language Models at Scale: Challenges and Optimizations

Different Forms of Parallelism in Model Training

Exploring Pipeline Parallelism and Micro-batches in Model Execution

Understanding Pipeline Flush and Parallelism Modes in Training

Trade-off between Tensor and Pipeline Model Parallelism in Deep Learning

The Impact of Pipeline Parallelism on Throughput

Communication Patterns and Optimization for Efficient Scaling

Scaling Experiment and the Importance of Compute Optimal Models

The importance of hiding communication in computing optimal models

Design Strategies for Improving Throughput in Distributed Matrix Multiplication

Improving Training and Communication in Large-Scale Machine Learning Models

Deepak's Interest in Inference and Improving Throughput on Autoregressive Inference

Training LLMs at Scale - Deepak Narayanan | Stanford MLSys #83

Stanford MLSys Seminars

00:03Training Large Language Models at Scale: Challenges and Optimizations

Training Large Language Models at Scale: Challenges and Optimizations

04:58Different Forms of Parallelism in Model Training

Different Forms of Parallelism in Model Training

09:07Exploring Pipeline Parallelism and Micro-batches in Model Execution

Exploring Pipeline Parallelism and Micro-batches in Model Execution

12:52Understanding Pipeline Flush and Parallelism Modes in Training

Understanding Pipeline Flush and Parallelism Modes in Training

16:17Trade-off between Tensor and Pipeline Model Parallelism in Deep Learning

Trade-off between Tensor and Pipeline Model Parallelism in Deep Learning

20:27The Impact of Pipeline Parallelism on Throughput

The Impact of Pipeline Parallelism on Throughput

27:12Communication Patterns and Optimization for Efficient Scaling

Communication Patterns and Optimization for Efficient Scaling

34:54Scaling Experiment and the Importance of Compute Optimal Models

Scaling Experiment and the Importance of Compute Optimal Models

40:08The importance of hiding communication in computing optimal models

The importance of hiding communication in computing optimal models

45:17Design Strategies for Improving Throughput in Distributed Matrix Multiplication

Design Strategies for Improving Throughput in Distributed Matrix Multiplication

50:30Improving Training and Communication in Large-Scale Machine Learning Models

Improving Training and Communication in Large-Scale Machine Learning Models

54:39Deepak's Interest in Inference and Improving Throughput on Autoregressive Inference

Deepak's Interest in Inference and Improving Throughput on Autoregressive Inference