In this episode of the Latent Space podcast, Alessio and Swyx interview Amir Haghighat and Yineng Zhang from Baseten about DeepSeek V3 and SGLang. The discussion covers the challenges of hosting large language models like DeepSeek V3, including hardware requirements (H200 GPUs, FP8 precision), the rise of fine-grained MOEs, and the increasing adoption of FP8 training. They also discuss Baseten's approach to serving models with dedicated inference resources, their open-source model packaging and deployment library Trust, and the importance of transparent infrastructure. The conversation further explores SGLang's unique features, such as RADx attention for KVCache optimization, concentrated decoding with FSM, and API speculative execution, as well as the three pillars for running mission-critical inference workloads: model-level performance, horizontal scaling, and developer experience.
Sign in to continue reading, translating and more.
Continue