Jim Fan, NVIDIA: Foundation models for embodied agents, scaling data, and why prompt engineering will become irrelevant

Generally capable autonomous agents require embodiment to ground knowledge in a rich, interactive world, moving beyond the limitations of text-only foundation models. By leveraging massive, open-ended environments like Minecraft, researchers can train agents to solve complex, long-horizon tasks through natural language prompts and learned reward functions, effectively bypassing the need for hand-curated objectives. This evolution toward unified architectures—using multimodal prompting to bridge the "embodiment gap"—is essential for advancing robotics, which currently lags behind NLP in standardization. As these systems move toward intuitive, goal-oriented interaction, the reliance on manual prompt engineering will diminish, replaced by agents capable of collaborative, multi-step reasoning. Ultimately, the future of AI lies in developing agents that learn from diverse sensory inputs and human-like exploration, fundamentally transforming how machines interact with physical and digital environments.

Outlines

Part 1: Evolution, Digital Environments

Part 2: Benchmarks, Scaling Infrastructure

Part 3: Robotics, Embodiment Challenges

Part 4: Alignment, Research Strategy

Sign in to continue reading, translating and more.

Continue

Generally Intelligent

Part 1: Evolution, Digital Environments

Evolution of AI Research from Domain-Specific Pipelines to Foundation Models

Web Browsing as a Universal Environment for Policy Learning

Part 2: Benchmarks, Scaling Infrastructure

Minecraft as an Open-Ended Benchmark for Embodied Intelligence

Comparing MineClip and Video Pre-training for Long-Horizon Tasks

Scaling Embodied AI Research with High-Performance Simulators

Part 3: Robotics, Embodiment Challenges

Multimodal Prompting and Unification in Robotics

Addressing the Embodiment Gap and Hardware Constraints in Robotics

Part 4: Alignment, Research Strategy

Reasoning Engines and the Future of AI Alignment

Research Taste and Strategic Career Development for PhD Students

Jim Fan, NVIDIA: Foundation models for embodied agents, scaling data, and why prompt engineering will become irrelevant

Generally Intelligent

Part 1: Evolution, Digital Environments

00:00Evolution of AI Research from Domain-Specific Pipelines to Foundation Models

Evolution of AI Research from Domain-Specific Pipelines to Foundation Models

11:31Web Browsing as a Universal Environment for Policy Learning

Web Browsing as a Universal Environment for Policy Learning

Part 2: Benchmarks, Scaling Infrastructure

23:31Minecraft as an Open-Ended Benchmark for Embodied Intelligence

Minecraft as an Open-Ended Benchmark for Embodied Intelligence

34:01Comparing MineClip and Video Pre-training for Long-Horizon Tasks

Comparing MineClip and Video Pre-training for Long-Horizon Tasks

44:16Scaling Embodied AI Research with High-Performance Simulators

Scaling Embodied AI Research with High-Performance Simulators

Part 3: Robotics, Embodiment Challenges

51:31Multimodal Prompting and Unification in Robotics

Multimodal Prompting and Unification in Robotics

1:00:26Addressing the Embodiment Gap and Hardware Constraints in Robotics

Addressing the Embodiment Gap and Hardware Constraints in Robotics

Part 4: Alignment, Research Strategy

1:09:01Reasoning Engines and the Future of AI Alignment

Reasoning Engines and the Future of AI Alignment

1:18:01Research Taste and Strategic Career Development for PhD Students

Research Taste and Strategic Career Development for PhD Students