Ep. 017 - DeepSeek V4 and Huawei Ascend NPU Performance (InferenceX) | Kimbo Chen, Cam Quilici, Bryan Shan, Jordan Nanos | SemiAnalysis Weekly

DeepSeek V4’s transition to a 1-million context length architecture relies on aggressive innovations in sparse attention and Mega MoE, which reduce KV cache memory requirements by approximately 100x compared to standard models. Achieving day-zero inference performance on new hardware requires complex engineering, specifically the fusion of communication and computation kernels to bypass traditional bottlenecks. While NVIDIA and AMD remain primary targets for optimization, the emergence of Huawei’s Ascend NPU ecosystem highlights a shift toward more diverse hardware support, driven by rapid open-source contributions and sophisticated software toolkits like CANN. The ongoing competition between inference runtimes such as VLLM and SGLang further accelerates these performance gains, forcing continuous iteration and refinement of kernel libraries to maximize throughput and efficiency for large-scale model deployment.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Ep. 017 - DeepSeek V4 and Huawei Ascend NPU Performance (InferenceX) | Kimbo Chen, Cam Quilici, Bryan Shan, Jordan Nanos

SemiAnalysis Weekly

DeepSeek V4 Architecture and Performance Innovations

Mega Kernel and MoE Optimization Strategies

Iterative Performance Gains and Hardware Support

Competitive Dynamics of Open Source Inference Runtimes

Huawei Ascend NPU Integration and Market Impact

Future Directions in Agentic Benchmarking and RL Systems

Ep. 017 - DeepSeek V4 and Huawei Ascend NPU Performance (InferenceX) | Kimbo Chen, Cam Quilici, Bryan Shan, Jordan Nanos

SemiAnalysis Weekly

00:52DeepSeek V4 Architecture and Performance Innovations

DeepSeek V4 Architecture and Performance Innovations

06:00Mega Kernel and MoE Optimization Strategies

Mega Kernel and MoE Optimization Strategies

10:29Iterative Performance Gains and Hardware Support

Iterative Performance Gains and Hardware Support

17:20Competitive Dynamics of Open Source Inference Runtimes

Competitive Dynamics of Open Source Inference Runtimes

25:34Huawei Ascend NPU Integration and Market Impact

Huawei Ascend NPU Integration and Market Impact

32:16Future Directions in Agentic Benchmarking and RL Systems

Future Directions in Agentic Benchmarking and RL Systems