The Big LLM Architecture Comparison

In this monologue podcast, Sebastian Raschka provides a detailed comparison of various Large Language Model (LLM) architectures released in 2025, contrasting them with the original GPT architecture. He covers DeepSeek V3, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen3, SmolLM3, Kimi 2, GPT-OSS, and Grok 2.5, focusing on architectural differences such as multi-head latent attention, mixture of experts, normalization layer placement, sliding window attention, and positional embeddings. He also touches on the trade-offs between model size, inference speed, memory usage, and training stability, and promotes his upcoming book on turning pre-trained models into reasoning models.

Outlines

Sign in to continue reading, translating and more.

Continue

Sebastian Raschka

Introduction to LLM Architecture Comparison

Grouped Query Attention and KV Caching

DeepSeek V3 and Multi-Head Latent Attention

DeepSeek V3 and Mixture of Experts

OLMo 2: A Transparent and Competitive Model

QK Norm and Architecture Comparison: OLMo 2 vs. Llama 3

Gemma 3 and Sliding Window Attention

Sliding Window Attention Details and Gemma 3 Normalization

Mistral Small 3.1 Architecture

Llama 4 and DeepSeek Comparison

Qwen3: Dense Models and Architecture Details

Qwen3 Implementation and SmolLM3

Kimi 2 and GPT-OSS

Grok 2.5 and Model Architecture Trends

Conclusion and New Book Announcement

The Big LLM Architecture Comparison

Sebastian Raschka

00:02Introduction to LLM Architecture Comparison

Introduction to LLM Architecture Comparison

02:54Grouped Query Attention and KV Caching

Grouped Query Attention and KV Caching

07:24DeepSeek V3 and Multi-Head Latent Attention

DeepSeek V3 and Multi-Head Latent Attention

15:13DeepSeek V3 and Mixture of Experts

DeepSeek V3 and Mixture of Experts

24:44OLMo 2: A Transparent and Competitive Model

OLMo 2: A Transparent and Competitive Model

32:01QK Norm and Architecture Comparison: OLMo 2 vs. Llama 3

QK Norm and Architecture Comparison: OLMo 2 vs. Llama 3

35:02Gemma 3 and Sliding Window Attention

Gemma 3 and Sliding Window Attention

41:51Sliding Window Attention Details and Gemma 3 Normalization

Sliding Window Attention Details and Gemma 3 Normalization

44:32Mistral Small 3.1 Architecture

Mistral Small 3.1 Architecture

47:54Llama 4 and DeepSeek Comparison

Llama 4 and DeepSeek Comparison

51:16Qwen3: Dense Models and Architecture Details

Qwen3: Dense Models and Architecture Details

55:58Qwen3 Implementation and SmolLM3

Qwen3 Implementation and SmolLM3

1:04:11Kimi 2 and GPT-OSS

Kimi 2 and GPT-OSS

1:14:58Grok 2.5 and Model Architecture Trends

Grok 2.5 and Model Architecture Trends

1:21:24Conclusion and New Book Announcement

Conclusion and New Book Announcement