In this monologue podcast, Sebastian Raschka provides a detailed comparison of various Large Language Model (LLM) architectures released in 2025, contrasting them with the original GPT architecture. He covers DeepSeek V3, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen3, SmolLM3, Kimi 2, GPT-OSS, and Grok 2.5, focusing on architectural differences such as multi-head latent attention, mixture of experts, normalization layer placement, sliding window attention, and positional embeddings. He also touches on the trade-offs between model size, inference speed, memory usage, and training stability, and promotes his upcoming book on turning pre-trained models into reasoning models.
Outlines
Part 1: Introduction and Memory Optimization
Part 2: Model Architectures: DeepSeek, OLMo
Part 3: Model Architectures: Gemma, Mistral
Part 4: Model Architectures: Llama, Qwen
Part 5: Model Architectures: Kimi, Grok
Part 6: Conclusion and Future Work
Sign in to continue reading, translating and more.
Open full episode in Podwise
