Jim Fan, NVIDIA: Foundation models for embodied agents, scaling data, and why prompt engineering will become irrelevant
Generally Intelligent
Generally capable autonomous agents require embodiment to ground knowledge in a rich, interactive world, moving beyond the limitations of text-only foundation models. By leveraging massive, open-ended environments like Minecraft, researchers can train agents to solve complex, long-horizon tasks through natural language prompts and learned reward functions, effectively bypassing the need for hand-curated objectives. This evolution toward unified architectures—using multimodal prompting to bridge the "embodiment gap"—is essential for advancing robotics, which currently lags behind NLP in standardization. As these systems move toward intuitive, goal-oriented interaction, the reliance on manual prompt engineering will diminish, replaced by agents capable of collaborative, multi-step reasoning. Ultimately, the future of AI lies in developing agents that learn from diverse sensory inputs and human-like exploration, fundamentally transforming how machines interact with physical and digital environments.
Part 1: Evolution, Digital Environments
Part 2: Benchmarks, Scaling Infrastructure
Part 3: Robotics, Embodiment Challenges
Part 4: Alignment, Research Strategy
Sign in to continue reading, translating and more.
Open full episode in Podwise