03 Jul 2025
27m

Unlocking Unstructured Data with LLMs

Podcast cover

The Data Exchange with Ben Lorica

The podcast explores the challenges and solutions for enterprises in processing unstructured data using large language models (LLMs). Shreya Shankar, a PhD student at UC Berkeley, introduces DocETL, a tool designed to extract semantic data from documents, aggregate it, and generate summaries. The discussion covers the limitations of bespoke NLP pipelines and crowdsourcing approaches, highlighting how LLMs simplify thematic extraction and analysis. Shankar also discusses DocWrangler, an IDE for writing DocETL pipelines, emphasizing the importance of UX in onboarding non-coders. The conversation touches on balancing accuracy and non-determinism in LLM outputs, and leveraging multiple LLMs for consensus.

Outlines

Part 1: Context, Core Concepts

Part 2: Accessibility, Data Management

Part 3: Development, UX, Optimization

Part 4: Future, Ecosystem, Scaling

Sign in to continue reading, translating and more.

Open full episode in Podwise