What does the next training paradigm look like?

The current AI research paradigm relies on Reinforcement Learning from Verifiable Rewards (RLVR) to build general-purpose agents capable of solving diverse, verifiable tasks. While scaling compute has historically driven progress, current models face significant bottlenecks regarding sample efficiency and continual learning. Because real-world environments are often non-stationary and lack deterministic simulators, models struggle to learn from sparse, unstructured data. To overcome these limitations, future advancements may depend on techniques like On-Policy Self-Distillation (OPSD) and "dreaming," where models generate their own simulated environments to rehearse skills. By distilling these experiences back into model weights, AI systems could evolve from static, pre-trained tools into agents that continuously learn through broad economic deployment, effectively turning every user interaction into a source of intelligence rather than relying solely on pre-release training.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Dwarkesh Patel

The RLVR Paradigm for Achieving AGI

The Bottleneck of Non-Replayable Environments in Computer Use

The Necessity of Continual Learning and Weight Updates

Advancing Continual Learning via OPSD and Dreaming

Future Trajectory of Autonomous AI Deployment

What does the next training paradigm look like?

Dwarkesh Patel

00:00The RLVR Paradigm for Achieving AGI

The RLVR Paradigm for Achieving AGI

02:12The Bottleneck of Non-Replayable Environments in Computer Use

The Bottleneck of Non-Replayable Environments in Computer Use

06:10The Necessity of Continual Learning and Weight Updates

The Necessity of Continual Learning and Weight Updates

11:48Advancing Continual Learning via OPSD and Dreaming

Advancing Continual Learning via OPSD and Dreaming

17:23Future Trajectory of Autonomous AI Deployment

Future Trajectory of Autonomous AI Deployment