DeepSeek V3, SGLang, and the state of Open Model Inference in 2025 (Quantization, MoEs, Pricing)

In this episode of the Latent Space podcast, Alessio and Swyx interview Amir Haghighat and Yineng Zhang from Baseten about DeepSeek V3 and SGLang. The discussion covers the challenges of hosting large language models like DeepSeek V3, including hardware requirements (H200 GPUs, FP8 precision), the rise of fine-grained MOEs, and the increasing adoption of FP8 training. They also discuss Baseten's approach to serving models with dedicated inference resources, their open-source model packaging and deployment library Trust, and the importance of transparent infrastructure. The conversation further explores SGLang's unique features, such as RADx attention for KVCache optimization, concentrated decoding with FSM, and API speculative execution, as well as the three pillars for running mission-critical inference workloads: model-level performance, horizontal scaling, and developer experience.

Outlines

Part 1: Introduction, DeepSeek V3 Overview

Part 2: Market Trends, User Motivations

Part 3: Architecture, MOE, Training

Part 4: Infrastructure, Deployment, Trust Framework

Part 5: SGLang, Inference Engines

Part 6: Technical Optimizations, Structured Output

Part 7: Future Outlook, Conclusion

Sign in to continue reading, translating and more.

Open full episode in Podwise

Latent Space

Part 1: Introduction, DeepSeek V3 Overview

Introduction to DeepSeek V3 and Baseten's Involvement

DeepSeek V3's Significance and the Challenges of Hosting It

Technical Hurdles and Model Size Comparisons

Part 2: Market Trends, User Motivations

Use Cases and Motivations for Using DeepSeek V3

Hardware Considerations and Baseten's Pricing Strategy

FP8 Training and Quantization Strategies

Part 3: Architecture, MOE, Training

FP8 Training Trends and Implementation Challenges

Fine-Grained MOE and Its Adoption

MOE Failures and DeepSeek's Competitive API Pricing

Part 4: Infrastructure, Deployment, Trust Framework

Baseten's Pricing Model and Customer Use Cases

Multi-Cloud Capabilities and Running DeepSeek V3

Trust Framework and its Evolution

Design Decisions Behind the Trust Framework

Trust Chains and Transparency

Part 5: SGLang, Inference Engines

Framework Overview and SGLang's Strengths

The Three Pillars of Mission Critical Inference Workloads

SGLang's Development and Unique Features

Reasons for Creating SGLang

Part 6: Technical Optimizations, Structured Output

RedisCache and KV Cache Optimization

Concentrated Decoding and Finite State Machines

XGrammar vs. Outlines for Structured Output

API Speculative Execution and SGLang's Growth

Part 7: Future Outlook, Conclusion

SGLang Roadmap and Speculative Decoding Techniques

RL Trainers and the Future of Fine-Tuning

Unasked Questions and Mission Critical Inference Workloads

Concluding Remarks and Thank You

DeepSeek V3, SGLang, and the state of Open Model Inference in 2025 (Quantization, MoEs, Pricing)

Latent Space

Part 1: Introduction, DeepSeek V3 Overview

00:04Introduction to DeepSeek V3 and Baseten's Involvement

Introduction to DeepSeek V3 and Baseten's Involvement

00:51DeepSeek V3's Significance and the Challenges of Hosting It

DeepSeek V3's Significance and the Challenges of Hosting It

03:06Technical Hurdles and Model Size Comparisons

Technical Hurdles and Model Size Comparisons

Part 2: Market Trends, User Motivations

04:44Use Cases and Motivations for Using DeepSeek V3

Use Cases and Motivations for Using DeepSeek V3

06:45Hardware Considerations and Baseten's Pricing Strategy

Hardware Considerations and Baseten's Pricing Strategy

08:10FP8 Training and Quantization Strategies

FP8 Training and Quantization Strategies

Part 3: Architecture, MOE, Training

10:35FP8 Training Trends and Implementation Challenges

FP8 Training Trends and Implementation Challenges

12:10Fine-Grained MOE and Its Adoption

Fine-Grained MOE and Its Adoption

13:40MOE Failures and DeepSeek's Competitive API Pricing

MOE Failures and DeepSeek's Competitive API Pricing

Part 4: Infrastructure, Deployment, Trust Framework

15:11Baseten's Pricing Model and Customer Use Cases

Baseten's Pricing Model and Customer Use Cases

16:31Multi-Cloud Capabilities and Running DeepSeek V3

Multi-Cloud Capabilities and Running DeepSeek V3

18:39Trust Framework and its Evolution

Trust Framework and its Evolution

20:56Design Decisions Behind the Trust Framework

Design Decisions Behind the Trust Framework

23:31Trust Chains and Transparency

Trust Chains and Transparency

Part 5: SGLang, Inference Engines

26:29Framework Overview and SGLang's Strengths

Framework Overview and SGLang's Strengths

28:11The Three Pillars of Mission Critical Inference Workloads

The Three Pillars of Mission Critical Inference Workloads

32:24SGLang's Development and Unique Features

SGLang's Development and Unique Features

34:55Reasons for Creating SGLang

Reasons for Creating SGLang

Part 6: Technical Optimizations, Structured Output

36:30RedisCache and KV Cache Optimization

RedisCache and KV Cache Optimization

38:50Concentrated Decoding and Finite State Machines

Concentrated Decoding and Finite State Machines

40:33XGrammar vs. Outlines for Structured Output

XGrammar vs. Outlines for Structured Output

42:22API Speculative Execution and SGLang's Growth

API Speculative Execution and SGLang's Growth

Part 7: Future Outlook, Conclusion

44:30SGLang Roadmap and Speculative Decoding Techniques

SGLang Roadmap and Speculative Decoding Techniques

46:32RL Trainers and the Future of Fine-Tuning

RL Trainers and the Future of Fine-Tuning

49:50Unasked Questions and Mission Critical Inference Workloads

Unasked Questions and Mission Critical Inference Workloads

54:40Concluding Remarks and Thank You

Concluding Remarks and Thank You