Everything you need to run Mission Critical Inference (ft. DeepSeek v3 + SGLang)

This interview podcast features Amir Haghighat and Yineng Zhang from Baseten, a leading large language model (LLM) inference platform, discussing the newly released DeepSeek v3 model. The discussion begins with an overview of DeepSeek v3's capabilities and its ranking on the LM Arena leaderboard, highlighting its significance as the best open-weights model. The conversation then shifts to the challenges of serving such large models, focusing on Baseten's use of H200 clusters and the importance of frameworks like SGLang for efficient inference. Finally, the speakers delve into the three pillars of mission-critical inference workloads—model-level performance, horizontal scalability, and developer experience—with a detailed explanation of Baseten's approach and the unique features of SGLang. A specific takeaway is the emphasis on Baseten's consumption-based pricing model, contrasting it with per-token pricing and highlighting its suitability for customers with custom models and strict performance requirements.

Outlines

Part 1: Introduction, DeepSeek v3

Part 2: DeepSeek v3, Technical Challenges

Part 3: Baseten's Solutions, Inference

Part 4: Future Trends, Conclusion

Sign in to continue reading, translating and more.

Continue

Latent Space: The AI Engineer Podcast

Part 1: Introduction, DeepSeek v3

Introduction: DeepSeek v3 Launch and Leaderboard Ranking

DeepSeek v3 Inference Challenges and Baseten's Solution

Baseten's Background and Collaboration

Introducing the Guests and Baseten's Role

Part 2: DeepSeek v3, Technical Challenges

DeepSeek v3's Popularity and Open Weights

Challenges of Serving DeepSeek v3

Baseten's Model Support and User Base

FP8 Training and Quantization Trends

Fine-Grained MOE Architectures in LLMs

Part 3: Baseten's Solutions, Inference

Baseten's Pricing Strategy for LLM Inference

Baseten's Inference Service Architecture and Trust SDK

Comparison of LLM Inference Frameworks

Three Pillars of Mission-Critical Inference Workloads

Part 4: Future Trends, Conclusion

SGLang Roadmap and Future Developments

The Future of Fine-tuning and RLHF for LLMs

Conclusion and Call to Action

Everything you need to run Mission Critical Inference (ft. DeepSeek v3 + SGLang)

Latent Space: The AI Engineer Podcast

Part 1: Introduction, DeepSeek v3

00:00Introduction: DeepSeek v3 Launch and Leaderboard Ranking

Introduction: DeepSeek v3 Launch and Leaderboard Ranking

01:14DeepSeek v3 Inference Challenges and Baseten's Solution

DeepSeek v3 Inference Challenges and Baseten's Solution

01:58Baseten's Background and Collaboration

Baseten's Background and Collaboration

03:40Introducing the Guests and Baseten's Role

Introducing the Guests and Baseten's Role

Part 2: DeepSeek v3, Technical Challenges

04:51DeepSeek v3's Popularity and Open Weights

DeepSeek v3's Popularity and Open Weights

05:47Challenges of Serving DeepSeek v3

Challenges of Serving DeepSeek v3

07:51Baseten's Model Support and User Base

Baseten's Model Support and User Base

11:35FP8 Training and Quantization Trends

FP8 Training and Quantization Trends

16:35Fine-Grained MOE Architectures in LLMs

Fine-Grained MOE Architectures in LLMs

Part 3: Baseten's Solutions, Inference

19:16Baseten's Pricing Strategy for LLM Inference

Baseten's Pricing Strategy for LLM Inference

21:44Baseten's Inference Service Architecture and Trust SDK

Baseten's Inference Service Architecture and Trust SDK

30:05Comparison of LLM Inference Frameworks

Comparison of LLM Inference Frameworks

32:00Three Pillars of Mission-Critical Inference Workloads

Three Pillars of Mission-Critical Inference Workloads

Part 4: Future Trends, Conclusion

48:05SGLang Roadmap and Future Developments

SGLang Roadmap and Future Developments

52:52The Future of Fine-tuning and RLHF for LLMs

The Future of Fine-tuning and RLHF for LLMs

58:26Conclusion and Call to Action

Conclusion and Call to Action