30 Apr 2026
54m

How to Engineer AI Inference Systems with Philip Kiely - #766

Podcast cover

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Inference engineering represents the most critical and complex workload in the AI stack, requiring a multidisciplinary approach that blends GPU-level programming, distributed systems architecture, and the rapid application of emerging research. As AI models scale, the timeline for moving research into production has compressed to hours, necessitating specialized expertise to optimize performance, cost, and latency. Philip Kiely, head of AI education at Base10, emphasizes that moving beyond generic, per-token API models toward dedicated, workload-specific deployments allows companies to achieve superior performance and cost-efficiency. This shift involves mastering hardware-specific optimizations, such as quantization and KV cache management, to build robust, agentic systems. Ultimately, the ability to control inference outcomes—rather than relying on opaque third-party providers—is becoming a primary competitive advantage for companies aiming to deliver high-performance, reliable AI products at scale.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise