This episode explores the challenges and advancements in deploying Large Language Models (LLMs) and AI agents at scale, particularly within the context of LinkedIn's infrastructure. Against the backdrop of LinkedIn's significant investment in GPU infrastructure (a 7x increase in fleet size and 150x increase in model training size), the conversation highlights the rising cost of inferencing as a major hurdle. More significantly, the discussion delves into the complexities of applying LLMs to traditional machine learning (ML) tasks like recommendation ranking (Rexis), questioning whether it's always the optimal solution and emphasizing the need for cost-effective, low-latency solutions. For instance, the use of LLMs in real-time feed generation is examined, weighing the benefits of improved personalization against the increased computational expense. The development of Liger, an open-source initiative aimed at improving GPU kernel efficiency and reducing training times, is presented as a key innovation in addressing these challenges. Finally, the episode touches upon the evolving role of memory optimization and the need for more elastic and serverless architectures to maximize GPU utilization and minimize costs, concluding with a discussion on the convergence and divergence of infrastructure needs for traditional ML and LLM-based applications.
Sign in to continue reading, translating and more.
Continue