YouTube23 Oct 2024
16m

Delivering Inference at Scale

Podcast cover

Open Compute Project

This podcast explores the challenges and solutions for scaling AI inference. The speakers point out the crucial differences between AI training and inference, noting that inference demands low latency and cost-effectiveness, which often makes CPUs a better option than GPUs. They also tackle the increasing capacity crisis in data centers driven by rising AI workloads and power consumption. Advocating for a comprehensive approach to system design, they emphasize the importance of considering performance per rack and total cost of ownership at the data center level. The conversation includes examples of collaborative initiatives and practical solutions from Ampere, Supermicro, and other partners in the ARM ecosystem, highlighting efficient and budget-friendly inference options.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise