Episode cover
YouTube27 May 2026

Stanford CS153 Frontier Systems | The Discipline of Delivering Value per Gigawatt

Podcast cover

Stanford Online

Scaling AI infrastructure requires moving beyond raw compute capacity toward maximizing "good put" and value per dollar. As training frontier models necessitates massive, synchronous clusters, system balance—optimizing the ratio of HBM bandwidth, network throughput, and compute—becomes the primary technical challenge. Reliability is paramount, as a single node failure can halt entire training runs, forcing a shift from traditional loose coupling to highly orchestrated, specialized hardware designs. Energy availability remains the most significant long-term bottleneck, necessitating a portfolio approach that includes wind, solar, and innovative grid-integrated demand response. Ultimately, the future of infrastructure lies in specialized hardware—such as the divergence between training and inference-optimized chips—and a commitment to making data centers community assets that provide grid stability rather than just consuming power.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise