22 Jan 2026
43m

Inferact: Building the Infrastructure That Runs Modern AI

Podcast cover

AI + a16z

The challenges of running AI models, specifically inference, are discussed with Simon Mo and Woosuk Kwon, co-founders of Inferact and creators of the open-source inference engine vLLM. They highlight the shift from training smarter models to efficiently running them, addressing the unpredictable nature of large language model requests, which differ from traditional computing workloads due to varying prompt lengths and real-time demands. vLLM addresses these challenges through innovations in scheduling and memory management, notably with "page attention." The open-source nature of vLLM fosters diverse contributions from model providers, silicon vendors, and infrastructure providers, creating a collaborative ecosystem. The increasing scale and diversity of models, along with the rise of AI agents, further complicate inference, requiring continuous innovation and adaptation.

Outlines

Part 1: Context, Origins of vLLM

Part 2: Community, Governance, Funding

Part 3: Technical Architecture, Scaling Challenges

Part 4: Open Source Strategy, Industry Adoption

Part 5: Inferact, Future Outlook

Sign in to continue reading, translating and more.

Open full episode in Podwise