Episode cover
01 Jun 2026
1h 43m

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Podcast cover

Latent Space: The AI Engineer Podcast

Video generation models are evolving into interactive "world models" capable of real-time, long-horizon content generation. Building these systems requires overcoming massive data and compute bottlenecks, often by using synthetic language-to-video pairs and VAE-based latent space compression. While diffusion models handle pixel generation, the core intelligence increasingly stems from large language models that act as prompt rewriters and agentic harnesses. These agents manage long-context memory and tool-calling, allowing for iterative refinement and the creation of generative user interfaces that respond directly to human intent. As inference costs decline, this technology promises to replace traditional, code-based interfaces with pixel-level, personalized experiences. Ethan He, formerly of xAI, highlights that the future of this field lies in integrating these generative capabilities into autonomous agents that can program and refine their own outputs in real-time.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise