In this episode of the Kubernetes Podcast from Google, hosts Kaslin Fields and Mofir Ahmed interview Clayton Coleman and Rob Shaw about running large language models (LLMs) on Kubernetes. The discussion covers why LLMs present unique challenges compared to traditional web applications, focusing on resource usage, scale, and the need for specialized load balancing. They delve into the role of projects like Inference Gateway and vLLM in optimizing LLM serving, and how LLMD aims to integrate these tools to create well-lit paths for production-grade inference applications. The guests also discuss the importance of open-source collaboration in this rapidly evolving field, and speculate on the future of AI model serving within Kubernetes, highlighting the shift towards open innovation, hardware advancements, and the rise of agentic applications. The episode also touches on Kubernetes 1.34 release, Qcrash event, and CNCF's top open-source projects.
Sign in to continue reading, translating and more.
Continue