Episode cover
YouTube04 Jun 2026

Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence

Podcast cover

Stanford Online

Native multimodal language models extend transformer architectures by tokenizing diverse inputs—images, audio, and video—into a unified sequence for autoregressive generation. While current paradigms successfully leverage scaling laws from large language models, they face challenges in balancing image understanding and generation. Architectures like Chameleon and Transfusion demonstrate that discretizing images or combining autoregressive objectives with diffusion can enhance multimodal capabilities. The Mixture of Transformers (MOT) approach further improves efficiency by employing modality-specific parameters, allowing for specialized processing without sacrificing text performance. Despite these advancements, a fundamental gap remains in unifying perception, generation, and reasoning. Language continues to serve as the primary backbone for complex reasoning, as current models struggle to achieve positive transfer from sensory data like video to abstract cognitive tasks, leaving physical-world intelligence as an open research frontier.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise