In this episode of the Latent Space podcast, we dive into the highlights from the 2024 International Conference on Machine Learning (ICML), with a special focus on generative video models. We explore presentations on OpenAI's Sora, Google DeepMind's Genie, and VideoPoet, examining their strengths and weaknesses in producing high-quality, controllable videos. The conversation also touches on the latest advancements in diffusion models and their applications across different formats, including audio and speech, while addressing the challenges of evaluating generative models. To wrap up, we discuss the intersection of large language models and computer vision, stressing the significance of data and efficient training techniques in robotics and reinforcement learning, along with the necessity for automated environment shaping.
ICML 2024 Recap Introduction & NeurIPS 2024 Preview
Sora: OpenAI's First Video Generation Model
Sora's Technical Details: Unified Visual Representation & Diffusion Transformers
Sora's Capabilities: Generalization, Zero-Shot Editing, and Video Blending
Sora's Emerging Simulation Capabilities: 3D Consistency, Long-Range Coherence, and Object Permanence
Sora's Future: Digital World Simulation and Failure Cases
Sora Q&A: Movie Production, Training Data, Camera Control, and Audio
Sora Q&A (Continued): Image vs. Video Generation, Model Size, and Control
Sora Q&A (Conclusion): Agentic Behavior, Inductive Biases, and Real-World Applications
Genie: Google DeepMind's Generative Interactive Environments
Genie Oral Presentation: Motivation and Model Architecture
Genie Results: Environment Generation, Human Interaction, and Consistency
Genie: Future Directions and Q&A
Genie Poster Session: Origin Story and Applications
VideoPoet: A Large Language Model for Zero-Shot Video Generation
VideoPoet Oral Presentation: Approach and Results
VideoPoet Q&A: Instruction Following, Video Tokenization, and Open Sourcing
VideoPoet Poster Session: Origin Story and Future Directions
VideoPoet Q&A (Continued): Language Model vs. Diffusion Model Trade-offs and Future Work
Text, Camera, Action: The Future of Video Generation Beyond Data and Scale
Single Video Models and Layered Neural Atlases for Video Editing
Combining Single Video Models and Foundation Models for Enhanced Video Generation
Leveraging Text-to-Image Models for Video Synthesis: TokenFlow and Scenescape
Space-Time Features for Text-Driven Motion Transfer
Text-Driven Motion Transfer: Evaluation and Limitations
Text, Camera, Action Q&A: Open Source Models and Controllability
Diffusion Models: An Intuitive Geometric Perspective
Alternative Perspectives on Diffusion Models: Recurrent Networks and Spectral Analysis
Diffusion Guidance: A Cheat Code for Diffusion Models
Imagine 3 and VEO: DeepMind's Text-to-Image and Text-to-Video Models
Diffusion Models Q&A: Future Capabilities, Latent Diffusion, and Evaluation Metrics
Inferring 3D Structure with 2D Priors: Addressing the Challenges of 3D Generation
DreamFusion and Reconfusion: Text-to-3D and View Reconstruction using Score Distillation
Cat3D: Efficient Multi-View 3D Generation
Cat3D Q&A: Combining with Multi-Diffusion, Multi-View Approaches, and Artifacts
Flow Matching: A General Framework for Generative Modeling
Flow Matching in Non-Euclidean Spaces: Riemannian Manifolds and Material Generation
Discrete Flow Matching: Generative Modeling in Discrete Domains
Discrete Flow Matching: Special Cases and Applications
Stable Diffusion 3: Scaling Rectified Flow Transformers
Stable Diffusion 3: Scaling Study and Results
Speech Synthesis with Diffusion Models: Natural Speech 3 and DiffS4L
DCAF: A Retrospective on the Vision Foundation Model
DCAF and the Pre-training Paradigm: Then and Now
Computer Vision in the Age of LLMs: Data, Models, and the Future
CGLIP and Beyond CLIP: Addressing Limitations of Contrastive Learning
Poly: Multi-Stage Training for Vision Language Models
PolyGemma: An Open-Source Vision Language Model and RL Tuning
Learning Actions, Policies, and Rewards from Videos
ILPO: Inferring Latent Actions and Policies from Videos
Learning Value Functions and Inverse Dynamics from Suboptimal Video Demonstrations
Learning Rewards from Videos and Genie: Generative Interactive Environments
VQ-BETT: A Scalable Behavior Generation Model for Robotics
VQ-BETT: Data Collection, Performance, and Future Directions
Lessons on Robotics: Data Scarcity and the Need for New Learning Regimes
Addressing Data Scarcity in Robotics: Natural Supervision, External Data, and Test-Time Adaptation
Yay Robot: Using Language Feedback for Robotic Task Improvement
Clarify: Using Language Feedback for Image Classification Improvement
Leveraging Pre-trained Vision-Language Models for Robotic Control
Test-Time Adaptation in Robotics: Leveraging History for Improved Performance
Automatic Environment Shaping: The Next Frontier in Reinforcement Learning
Sign in to continue reading, translating and more.
Open full episode in Podwise