DeepSeek V3, SGLang, and the state of Open Model Inference in 2025 (Quantization, MoEs, Pricing)
Latent Space
In this episode of the Latent Space podcast, Alessio and Swyx interview Amir Haghighat and Yineng Zhang from Baseten about DeepSeek V3 and SGLang. The discussion covers the challenges of hosting large language models like DeepSeek V3, including hardware requirements (H200 GPUs, FP8 precision), the rise of fine-grained MOEs, and the increasing adoption of FP8 training. They also discuss Baseten's approach to serving models with dedicated inference resources, their open-source model packaging and deployment library Trust, and the importance of transparent infrastructure. The conversation further explores SGLang's unique features, such as RADx attention for KVCache optimization, concentrated decoding with FSM, and API speculative execution, as well as the three pillars for running mission-critical inference workloads: model-level performance, horizontal scaling, and developer experience.
Part 1: Introduction, DeepSeek V3 Overview
Part 2: Market Trends, User Motivations
Part 3: Architecture, MOE, Training
Part 4: Infrastructure, Deployment, Trust Framework
Part 5: SGLang, Inference Engines
Part 6: Technical Optimizations, Structured Output
Part 7: Future Outlook, Conclusion
Sign in to continue reading, translating and more.
Open full episode in Podwise
