YouTube19 Jan 2025
57m

DeepSeek V3, SGLang, and the state of Open Model Inference in 2025 (Quantization, MoEs, Pricing)

Podcast cover

Latent Space

In this episode of the Latent Space podcast, Alessio and Swyx interview Amir Haghighat and Yineng Zhang from Baseten about DeepSeek V3 and SGLang. The discussion covers the challenges of hosting large language models like DeepSeek V3, including hardware requirements (H200 GPUs, FP8 precision), the rise of fine-grained MOEs, and the increasing adoption of FP8 training. They also discuss Baseten's approach to serving models with dedicated inference resources, their open-source model packaging and deployment library Trust, and the importance of transparent infrastructure. The conversation further explores SGLang's unique features, such as RADx attention for KVCache optimization, concentrated decoding with FSM, and API speculative execution, as well as the three pillars for running mission-critical inference workloads: model-level performance, horizontal scaling, and developer experience.

Outlines

Part 1: Introduction, DeepSeek V3 Overview

Part 2: Market Trends, User Motivations

Part 3: Architecture, MOE, Training

Part 4: Infrastructure, Deployment, Trust Framework

Part 5: SGLang, Inference Engines

Part 6: Technical Optimizations, Structured Output

Part 7: Future Outlook, Conclusion

Sign in to continue reading, translating and more.

Open full episode in Podwise