10 Dec 2024

7h 7m

Generative Video WorldSim, Diffusion, Vision, Reinforcement Learning and Robotics — ICML 2024 Part 1

Latent Space: The AI Engineer Podcast

In this episode of the Latent Space podcast, we dive into the highlights from the 2024 International Conference on Machine Learning (ICML), with a special focus on generative video models. We explore presentations on OpenAI's Sora, Google DeepMind's Genie, and VideoPoet, examining their strengths and weaknesses in producing high-quality, controllable videos. The conversation also touches on the latest advancements in diffusion models and their applications across different formats, including audio and speech, while addressing the challenges of evaluating generative models. To wrap up, we discuss the intersection of large language models and computer vision, stressing the significance of data and efficient training techniques in robotics and reinforcement learning, along with the necessity for automated environment shaping.

Outlines

Open full episode in Podwise

Generative Video WorldSim, Diffusion, Vision, Reinforcement Learning and Robotics — ICML 2024 Part 1

Latent Space: The AI Engineer Podcast

00:00ICML 2024 Recap Introduction & NeurIPS 2024 Preview

ICML 2024 Recap Introduction & NeurIPS 2024 Preview

02:04Sora: OpenAI's First Video Generation Model

Sora: OpenAI's First Video Generation Model

05:28Sora's Technical Details: Unified Visual Representation & Diffusion Transformers

Sora's Technical Details: Unified Visual Representation & Diffusion Transformers

08:44Sora's Capabilities: Generalization, Zero-Shot Editing, and Video Blending

Sora's Capabilities: Generalization, Zero-Shot Editing, and Video Blending

12:40Sora's Emerging Simulation Capabilities: 3D Consistency, Long-Range Coherence, and Object Permanence

Sora's Emerging Simulation Capabilities: 3D Consistency, Long-Range Coherence, and Object Permanence

17:51Sora's Future: Digital World Simulation and Failure Cases

Sora's Future: Digital World Simulation and Failure Cases

24:27Sora Q&A: Movie Production, Training Data, Camera Control, and Audio

Sora Q&A: Movie Production, Training Data, Camera Control, and Audio

33:16Sora Q&A (Continued): Image vs. Video Generation, Model Size, and Control

Sora Q&A (Continued): Image vs. Video Generation, Model Size, and Control

40:23Sora Q&A (Conclusion): Agentic Behavior, Inductive Biases, and Real-World Applications

Sora Q&A (Conclusion): Agentic Behavior, Inductive Biases, and Real-World Applications

42:24Genie: Google DeepMind's Generative Interactive Environments

Genie: Google DeepMind's Generative Interactive Environments

44:47Genie Oral Presentation: Motivation and Model Architecture

Genie Oral Presentation: Motivation and Model Architecture

48:18Genie Results: Environment Generation, Human Interaction, and Consistency

Genie Results: Environment Generation, Human Interaction, and Consistency

53:57Genie: Future Directions and Q&A

Genie: Future Directions and Q&A

59:59Genie Poster Session: Origin Story and Applications

Genie Poster Session: Origin Story and Applications

1:04:09VideoPoet: A Large Language Model for Zero-Shot Video Generation

VideoPoet: A Large Language Model for Zero-Shot Video Generation

1:12:37VideoPoet Oral Presentation: Approach and Results

VideoPoet Oral Presentation: Approach and Results

1:24:23VideoPoet Q&A: Instruction Following, Video Tokenization, and Open Sourcing

VideoPoet Q&A: Instruction Following, Video Tokenization, and Open Sourcing

1:29:06VideoPoet Poster Session: Origin Story and Future Directions

VideoPoet Poster Session: Origin Story and Future Directions

1:36:59VideoPoet Q&A (Continued): Language Model vs. Diffusion Model Trade-offs and Future Work

VideoPoet Q&A (Continued): Language Model vs. Diffusion Model Trade-offs and Future Work

1:41:20Text, Camera, Action: The Future of Video Generation Beyond Data and Scale

Text, Camera, Action: The Future of Video Generation Beyond Data and Scale

1:47:03Single Video Models and Layered Neural Atlases for Video Editing

Single Video Models and Layered Neural Atlases for Video Editing

1:52:18Combining Single Video Models and Foundation Models for Enhanced Video Generation

Combining Single Video Models and Foundation Models for Enhanced Video Generation

1:56:08Leveraging Text-to-Image Models for Video Synthesis: TokenFlow and Scenescape

Leveraging Text-to-Image Models for Video Synthesis: TokenFlow and Scenescape

2:02:06Space-Time Features for Text-Driven Motion Transfer

Space-Time Features for Text-Driven Motion Transfer

2:12:42Text-Driven Motion Transfer: Evaluation and Limitations

Text-Driven Motion Transfer: Evaluation and Limitations

2:18:16Text, Camera, Action Q&A: Open Source Models and Controllability

Text, Camera, Action Q&A: Open Source Models and Controllability

2:26:21Diffusion Models: An Intuitive Geometric Perspective

Diffusion Models: An Intuitive Geometric Perspective

2:35:36Alternative Perspectives on Diffusion Models: Recurrent Networks and Spectral Analysis

Alternative Perspectives on Diffusion Models: Recurrent Networks and Spectral Analysis

2:42:29Diffusion Guidance: A Cheat Code for Diffusion Models

Diffusion Guidance: A Cheat Code for Diffusion Models

2:54:40Imagine 3 and VEO: DeepMind's Text-to-Image and Text-to-Video Models

Imagine 3 and VEO: DeepMind's Text-to-Image and Text-to-Video Models

2:58:02Diffusion Models Q&A: Future Capabilities, Latent Diffusion, and Evaluation Metrics

Diffusion Models Q&A: Future Capabilities, Latent Diffusion, and Evaluation Metrics

3:07:08Inferring 3D Structure with 2D Priors: Addressing the Challenges of 3D Generation

Inferring 3D Structure with 2D Priors: Addressing the Challenges of 3D Generation

3:14:49DreamFusion and Reconfusion: Text-to-3D and View Reconstruction using Score Distillation

DreamFusion and Reconfusion: Text-to-3D and View Reconstruction using Score Distillation

3:22:29Cat3D: Efficient Multi-View 3D Generation

Cat3D: Efficient Multi-View 3D Generation

3:28:35Cat3D Q&A: Combining with Multi-Diffusion, Multi-View Approaches, and Artifacts

Cat3D Q&A: Combining with Multi-Diffusion, Multi-View Approaches, and Artifacts

3:33:08Flow Matching: A General Framework for Generative Modeling

Flow Matching: A General Framework for Generative Modeling

3:40:22Flow Matching in Non-Euclidean Spaces: Riemannian Manifolds and Material Generation

Flow Matching in Non-Euclidean Spaces: Riemannian Manifolds and Material Generation

3:45:02Discrete Flow Matching: Generative Modeling in Discrete Domains

Discrete Flow Matching: Generative Modeling in Discrete Domains

3:51:14Discrete Flow Matching: Special Cases and Applications

Discrete Flow Matching: Special Cases and Applications