The challenges of running AI models, specifically inference, are discussed with Simon Mo and Woosuk Kwon, co-founders of Inferact and creators of the open-source inference engine vLLM. They highlight the shift from training smarter models to efficiently running them, addressing the unpredictable nature of large language model requests, which differ from traditional computing workloads due to varying prompt lengths and real-time demands. vLLM addresses these challenges through innovations in scheduling and memory management, notably with "page attention." The open-source nature of vLLM fosters diverse contributions from model providers, silicon vendors, and infrastructure providers, creating a collaborative ecosystem. The increasing scale and diversity of models, along with the rise of AI agents, further complicate inference, requiring continuous innovation and adaptation.
Sign in to continue reading, translating and more.
Continue