How GPT-5, Claude, and Gemini are actually trained and served – Reiner Pope | Dwarkesh Patel

Inference efficiency and scaling in large language models depend heavily on the interplay between memory bandwidth, compute throughput, and batch size. Optimizing inference requires balancing these hardware constraints, where batching serves as a critical mechanism to amortize weight-loading costs. As models scale, sparse mixture-of-experts architectures and pipeline parallelism emerge as essential strategies to manage memory capacity and communication bottlenecks within GPU racks. While memory bandwidth remains a primary constraint for long-context performance, the strategic use of memory tiers—from HBM to flash and disk—allows for cost-effective token generation. Ultimately, aligning training and inference compute costs provides a heuristic for determining optimal model scaling, suggesting that the total volume of inference tokens should roughly scale with pre-training data to maximize performance efficiency.

Outlines

Sign in to continue reading, translating and more.

Continue

How GPT-5, Claude, and Gemini are actually trained and served – Reiner Pope

Dwarkesh Patel

Inference Economics and Batching Dynamics in Transformer Models

Mixture of Experts and Rack-Scale Communication Constraints

Scaling Model Size and Pipeline Parallelism Strategies

Memory Walls and Compute Cost Equalization in Pre-training

API Pricing Signals and Memory Hierarchy Tiers

Cryptographic Architectures and Reversible Neural Networks

How GPT-5, Claude, and Gemini are actually trained and served – Reiner Pope

Dwarkesh Patel

00:00Inference Economics and Batching Dynamics in Transformer Models

Inference Economics and Batching Dynamics in Transformer Models

27:53Mixture of Experts and Rack-Scale Communication Constraints

Mixture of Experts and Rack-Scale Communication Constraints

44:45Scaling Model Size and Pipeline Parallelism Strategies

Scaling Model Size and Pipeline Parallelism Strategies

1:03:28Memory Walls and Compute Cost Equalization in Pre-training

Memory Walls and Compute Cost Equalization in Pre-training

1:32:51API Pricing Signals and Memory Hierarchy Tiers

API Pricing Signals and Memory Hierarchy Tiers

2:03:38Cryptographic Architectures and Reversible Neural Networks

Cryptographic Architectures and Reversible Neural Networks