Inference Engineering 101 - with Philip Kiely | 0xSero

Inference engineering bridges the gap between raw AI model intelligence and production-ready performance. Philip Kiely, an Inference Engineer at BaseTen, explains that achieving efficient model deployment requires navigating complex trade-offs between compute-bound prefill phases and memory-bound decoding processes. Key optimization strategies include quantization—specifically transitioning to formats like NVFP4—and leveraging mixture-of-experts architectures to reduce hardware requirements. While large-scale data centers utilize disaggregated inference to maximize throughput, local self-hosting remains a distinct challenge involving hardware constraints and specialized parallelism. The conversation highlights how continuous advancements in model architecture and hardware, such as NVIDIA’s Blackwell stack, enable developers to push performance limits. Ultimately, the field is shifting from academic benchmarks toward binary production-readiness, where specialized optimizations significantly reduce the cost and energy required to serve high-demand AI applications.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Inference Engineering 101 - with Philip Kiely

0xSero

The Evolution of Model Intelligence and Inference Engineering

Bridging Local and Data Center Inference

Technical Optimization Strategies for High-Performance Inference

Navigating Hardware Constraints and Disaggregated Inference

The Future of Local AI and Hardware Accessibility

Framework Comparisons and Troubleshooting Model Performance

Inference Engineering 101 - with Philip Kiely

0xSero

00:00The Evolution of Model Intelligence and Inference Engineering

The Evolution of Model Intelligence and Inference Engineering

06:16Bridging Local and Data Center Inference

Bridging Local and Data Center Inference

16:00Technical Optimization Strategies for High-Performance Inference

Technical Optimization Strategies for High-Performance Inference

25:15Navigating Hardware Constraints and Disaggregated Inference

Navigating Hardware Constraints and Disaggregated Inference

33:40The Future of Local AI and Hardware Accessibility

The Future of Local AI and Hardware Accessibility

45:00Framework Comparisons and Troubleshooting Model Performance

Framework Comparisons and Troubleshooting Model Performance