In this monologue podcast, Sebastian Raschka provides a detailed comparison of various Large Language Model (LLM) architectures released in 2025, contrasting them with the original GPT architecture. He covers DeepSeek V3, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen3, SmolLM3, Kimi 2, GPT-OSS, and Grok 2.5, focusing on architectural differences such as multi-head latent attention, mixture of experts, normalization layer placement, sliding window attention, and positional embeddings. He also touches on the trade-offs between model size, inference speed, memory usage, and training stability, and promotes his upcoming book on turning pre-trained models into reasoning models.
Sign in to continue reading, translating and more.
Continue