The AI revolution is running out of data. What can researchers do?

This Nature podcast discusses the impending data shortage for AI development. A study projects that by 2028, AI models will have consumed all readily available online text data, exacerbated by data providers restricting access to their content. The podcast explores potential solutions, including using proprietary data, focusing on specialized datasets, generating synthetic data, and improving model efficiency through techniques like data re-reading. While large language models (LLMs) may continue to grow, a shift towards smaller, task-specific models is anticipated. The legal implications of using copyrighted material for training AI are also highlighted, with ongoing lawsuits between news organizations and AI companies.

Outlines

Sign in to continue reading, translating and more.

Continue

Nature Podcast

AI's Data Depletion Crisis

The Impact of Data Restrictions and Legal Challenges

Potential Solutions and Alternative Strategies

Exploring Alternative Data Sources and Training Methods

Future Directions in AI Development

The AI revolution is running out of data. What can researchers do?

Nature Podcast

00:01AI's Data Depletion Crisis

AI's Data Depletion Crisis

04:20The Impact of Data Restrictions and Legal Challenges

The Impact of Data Restrictions and Legal Challenges

06:43Potential Solutions and Alternative Strategies

Potential Solutions and Alternative Strategies

09:31Exploring Alternative Data Sources and Training Methods

Exploring Alternative Data Sources and Training Methods

12:56Future Directions in AI Development

Future Directions in AI Development