This Nature podcast discusses the impending data shortage for AI development. A study projects that by 2028, AI models will have consumed all readily available online text data, exacerbated by data providers restricting access to their content. The podcast explores potential solutions, including using proprietary data, focusing on specialized datasets, generating synthetic data, and improving model efficiency through techniques like data re-reading. While large language models (LLMs) may continue to grow, a shift towards smaller, task-specific models is anticipated. The legal implications of using copyrighted material for training AI are also highlighted, with ongoing lawsuits between news organizations and AI companies.
Sign in to continue reading, translating and more.
Continue