The next big breakthrough will be AIs learning on the job

The current AI research paradigm relies on Reinforcement Learning from Verifiable Rewards (RLVR) to build general-purpose agents, yet this approach faces significant hurdles in real-world, non-verifiable domains like business strategy or politics. Because current models are highly sample-inefficient, they struggle to adapt to the unstructured, sparse data encountered during deployment. To overcome this, future progress requires architectural innovations that enable continual learning, allowing models to update their weights based on on-the-job experience. Techniques like On-Policy Self-Distillation (OPSD) and "dreaming"—where models simulate environments to rehearse skills—offer promising paths to compress real-world insights into model parameters. Ultimately, the next generation of AI will likely improve not through static pre-training, but through continuous, iterative learning from broad economic deployment, effectively transforming every user interaction into a mechanism for model refinement.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Dwarkesh Podcast

Scaling RLVR and the Bottleneck of Deterministic Simulators

Generalization Challenges and the Inefficiency of Static Weights

Advancing Sample Efficiency Through Self-Distillation and Dreaming

Evolving AI Capabilities Through Continuous Real-World Deployment

The next big breakthrough will be AIs learning on the job

Dwarkesh Podcast

00:00Scaling RLVR and the Bottleneck of Deterministic Simulators

Scaling RLVR and the Bottleneck of Deterministic Simulators

06:09Generalization Challenges and the Inefficiency of Static Weights

Generalization Challenges and the Inefficiency of Static Weights

11:48Advancing Sample Efficiency Through Self-Distillation and Dreaming

Advancing Sample Efficiency Through Self-Distillation and Dreaming

17:23Evolving AI Capabilities Through Continuous Real-World Deployment

Evolving AI Capabilities Through Continuous Real-World Deployment