YouTube05 Jul 2025
15m

Show-o2: Improved Native Unified Multimodal Models

Podcast cover

Xiaol.x

In this episode of "The Deep Dive," the hosts dissect the AI model Show-o2, which aims to improve native unified multimodal models. They explore Show-o2's ability to seamlessly understand and generate text, images, and videos within a single architecture. The discussion covers Show-o2's four main contributions: integrating autoregressive modeling and flow matching, unified visual representations for images and videos using a 3D causal VAE and dual path fusion, a two-stage training pipeline, and state-of-the-art performance on benchmarks. The hosts also touch on related work, the model's structure, methodology, and training process, including the two-stage training recipe designed to prevent catastrophic forgetting. They delve into the empirical evidence supporting Show-o2's performance in multimodal understanding and generation, as well as its limitations, such as text rendering and fine detail in generated images, and broader ethical impacts.

Outlines

Part 1: Introduction to Unified Multimodal Models

Part 2: Show-o2 Architecture and Training

Part 3: Empirical Results and Ablation Studies

Part 4: Broader Impacts and Conclusion

Sign in to continue reading, translating and more.

Open full episode in Podwise