Inferact: Building the Infrastructure That Runs Modern AI

The challenges of running AI models, specifically inference, are discussed with Simon Mo and Woosuk Kwon, co-founders of Inferact and creators of the open-source inference engine vLLM. They highlight the shift from training smarter models to efficiently running them, addressing the unpredictable nature of large language model requests, which differ from traditional computing workloads due to varying prompt lengths and real-time demands. vLLM addresses these challenges through innovations in scheduling and memory management, notably with "page attention." The open-source nature of vLLM fosters diverse contributions from model providers, silicon vendors, and infrastructure providers, creating a collaborative ecosystem. The increasing scale and diversity of models, along with the rise of AI agents, further complicate inference, requiring continuous innovation and adaptation.

Outlines

Part 1: Context, Origins of vLLM

Part 2: Community, Governance, Funding

Part 3: Technical Architecture, Scaling Challenges

Part 4: Open Source Strategy, Industry Adoption

Part 5: Inferact, Future Outlook

Sign in to continue reading, translating and more.

Open full episode in Podwise

AI + a16z

Part 1: Context, Origins of vLLM

The Growing Importance of AI Inference and Open Source Infrastructure

vLLM's Origins: Optimizing Inference for Open Source Language Models

Autoregressive Transformers: Unique Technical Challenges in AI Workloads

Dynamic Nature of LLMs: Scheduling and Memory Management Challenges

Part 2: Community, Governance, Funding

vLLM's Growing Community and the First Meetup at a16z

vLLM's Community: Diverse Contributors and Shared Incentives

Managing a Large Open Source Contributor Pool: Philosophy and Practices

Grant Funding and the Growing Costs of Open Source AI Development

Part 3: Technical Architecture, Scaling Challenges

Defining Inference Engines: Architecture and Key Components

Increasing Difficulty of Inference: Scale, Diversity, and Agents

Diversity in Models and Hardware: Expanding Inference Challenges

Implementing Diverse Models: Collaboration and Standardization Challenges

Agentic Workloads: New Challenges for Inference and State Management

Part 4: Open Source Strategy, Industry Adoption

Open Source AI: Promoting Diversity and Innovation

Vertical Stack Integration: Specializing Models for Specific Architectures

vLLM Deployments: Amazon's Rufus Assistant and Cutting-Edge Adoption

Part 5: Inferact, Future Outlook

Inferact's Mission: Universal Inference Layer and Open Source Commitment

Open Source as a Secret Weapon: Execution and Adoption Advantages

Jan Stojka's Role and Recruiting Talents for Inferact

Solving Inference at Scale: Hiring ML Infra Engineers

Building a Universal Inference Layer: Abstraction and Innovation

Closing Remarks and Disclaimers

Inferact: Building the Infrastructure That Runs Modern AI

AI + a16z

Part 1: Context, Origins of vLLM

00:00The Growing Importance of AI Inference and Open Source Infrastructure

The Growing Importance of AI Inference and Open Source Infrastructure

02:34vLLM's Origins: Optimizing Inference for Open Source Language Models

vLLM's Origins: Optimizing Inference for Open Source Language Models

05:18Autoregressive Transformers: Unique Technical Challenges in AI Workloads

Autoregressive Transformers: Unique Technical Challenges in AI Workloads

07:25Dynamic Nature of LLMs: Scheduling and Memory Management Challenges

Dynamic Nature of LLMs: Scheduling and Memory Management Challenges

Part 2: Community, Governance, Funding

10:25vLLM's Growing Community and the First Meetup at a16z

vLLM's Growing Community and the First Meetup at a16z

12:59vLLM's Community: Diverse Contributors and Shared Incentives

vLLM's Community: Diverse Contributors and Shared Incentives

14:38Managing a Large Open Source Contributor Pool: Philosophy and Practices

Managing a Large Open Source Contributor Pool: Philosophy and Practices

17:25Grant Funding and the Growing Costs of Open Source AI Development

Grant Funding and the Growing Costs of Open Source AI Development

Part 3: Technical Architecture, Scaling Challenges

19:53Defining Inference Engines: Architecture and Key Components

Defining Inference Engines: Architecture and Key Components

22:20Increasing Difficulty of Inference: Scale, Diversity, and Agents

Increasing Difficulty of Inference: Scale, Diversity, and Agents

24:34Diversity in Models and Hardware: Expanding Inference Challenges

Diversity in Models and Hardware: Expanding Inference Challenges

26:38Implementing Diverse Models: Collaboration and Standardization Challenges

Implementing Diverse Models: Collaboration and Standardization Challenges

28:40Agentic Workloads: New Challenges for Inference and State Management

Agentic Workloads: New Challenges for Inference and State Management

Part 4: Open Source Strategy, Industry Adoption

30:37Open Source AI: Promoting Diversity and Innovation

Open Source AI: Promoting Diversity and Innovation

32:05Vertical Stack Integration: Specializing Models for Specific Architectures

Vertical Stack Integration: Specializing Models for Specific Architectures

33:31vLLM Deployments: Amazon's Rufus Assistant and Cutting-Edge Adoption

vLLM Deployments: Amazon's Rufus Assistant and Cutting-Edge Adoption

Part 5: Inferact, Future Outlook

35:11Inferact's Mission: Universal Inference Layer and Open Source Commitment

Inferact's Mission: Universal Inference Layer and Open Source Commitment

36:23Open Source as a Secret Weapon: Execution and Adoption Advantages

Open Source as a Secret Weapon: Execution and Adoption Advantages

37:34Jan Stojka's Role and Recruiting Talents for Inferact

Jan Stojka's Role and Recruiting Talents for Inferact

39:06Solving Inference at Scale: Hiring ML Infra Engineers

Solving Inference at Scale: Hiring ML Infra Engineers

41:33Building a Universal Inference Layer: Abstraction and Innovation

Building a Universal Inference Layer: Abstraction and Innovation

42:38Closing Remarks and Disclaimers

Closing Remarks and Disclaimers