Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence

Native multimodal language models extend transformer architectures by tokenizing diverse inputs—images, audio, and video—into a unified sequence for autoregressive generation. While current paradigms successfully leverage scaling laws from large language models, they face challenges in balancing image understanding and generation. Architectures like Chameleon and Transfusion demonstrate that discretizing images or combining autoregressive objectives with diffusion can enhance multimodal capabilities. The Mixture of Transformers (MOT) approach further improves efficiency by employing modality-specific parameters, allowing for specialized processing without sacrificing text performance. Despite these advancements, a fundamental gap remains in unifying perception, generation, and reasoning. Language continues to serve as the primary backbone for complex reasoning, as current models struggle to achieve positive transfer from sensory data like video to abstract cognitive tasks, leaving physical-world intelligence as an open research frontier.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Stanford Online

Transitioning from Large Language Models to Native Multimodal Intelligence

Comparing Chameleon and Transfusion Architectures for Multimodal Generation

Improving Multimodal Scaling with Mixture of Transformers Architecture

Advancements in Modality-Aware Architectures and Transfer Learning

Addressing Challenges in Spatial Reasoning and Autoregressive Modeling

Stanford CS25: Transformers United V6 I From Language Models to Native Multimodal Intelligence

Stanford Online

00:05Transitioning from Large Language Models to Native Multimodal Intelligence

Transitioning from Large Language Models to Native Multimodal Intelligence

11:24Comparing Chameleon and Transfusion Architectures for Multimodal Generation

Comparing Chameleon and Transfusion Architectures for Multimodal Generation

20:22Improving Multimodal Scaling with Mixture of Transformers Architecture

Improving Multimodal Scaling with Mixture of Transformers Architecture

31:38Advancements in Modality-Aware Architectures and Transfer Learning

Advancements in Modality-Aware Architectures and Transfer Learning

48:03Addressing Challenges in Spatial Reasoning and Autoregressive Modeling

Addressing Challenges in Spatial Reasoning and Autoregressive Modeling