Unlocking Unstructured Data with LLMs

The podcast explores the challenges and solutions for enterprises in processing unstructured data using large language models (LLMs). Shreya Shankar, a PhD student at UC Berkeley, introduces DocETL, a tool designed to extract semantic data from documents, aggregate it, and generate summaries. The discussion covers the limitations of bespoke NLP pipelines and crowdsourcing approaches, highlighting how LLMs simplify thematic extraction and analysis. Shankar also discusses DocWrangler, an IDE for writing DocETL pipelines, emphasizing the importance of UX in onboarding non-coders. The conversation touches on balancing accuracy and non-determinism in LLM outputs, and leveraging multiple LLMs for consensus.

Outlines

Part 1: Context, Core Concepts

Part 2: Accessibility, Data Management

Part 3: Development, UX, Optimization

Part 4: Future, Ecosystem, Scaling

Sign in to continue reading, translating and more.

Continue

The Data Exchange with Ben Lorica

Part 1: Context, Core Concepts

Addressing Unstructured Data Challenges with Large Language Models in Enterprises

Extracting Structure from Unstructured Text Using LLMs for Thematic Analysis

DocETL: Map-Reduce Pipelines for Semantic Data Processing with LLMs

Part 2: Accessibility, Data Management

Democratizing Data Extraction: Empowering Non-Coders with LLMs

Managing Non-Determinism and Building Data Warehouses with DocETL

Leveraging Open-Source Libraries and Domain-Specific Languages with DocETL

Data Types, Formats, and the Starting Point for DocETL Pipelines

Part 3: Development, UX, Optimization

DocWrangler: An IDE for Writing and Evaluating DocETL Pipelines

UX Innovation for Semantic Data Processing: Aligning Users, Pipelines, and Data

Reasoning Models and Supervised Fine-Tuning for DocETL Pipelines

Consensus-Based Pipelines and Model Selection in DocETL

Part 4: Future, Ecosystem, Scaling

Multimodal Foundation Models and the Future of DocETL

Related Projects and Intellectual Differences in Semantic Data Processing

Scaling DocETL and the Cost of LLMs

Unlocking Unstructured Data with LLMs

The Data Exchange with Ben Lorica

Part 1: Context, Core Concepts

00:03Addressing Unstructured Data Challenges with Large Language Models in Enterprises

Addressing Unstructured Data Challenges with Large Language Models in Enterprises

01:42Extracting Structure from Unstructured Text Using LLMs for Thematic Analysis

Extracting Structure from Unstructured Text Using LLMs for Thematic Analysis

03:54DocETL: Map-Reduce Pipelines for Semantic Data Processing with LLMs

DocETL: Map-Reduce Pipelines for Semantic Data Processing with LLMs

Part 2: Accessibility, Data Management

05:37Democratizing Data Extraction: Empowering Non-Coders with LLMs

Democratizing Data Extraction: Empowering Non-Coders with LLMs

07:04Managing Non-Determinism and Building Data Warehouses with DocETL

Managing Non-Determinism and Building Data Warehouses with DocETL

09:18Leveraging Open-Source Libraries and Domain-Specific Languages with DocETL

Leveraging Open-Source Libraries and Domain-Specific Languages with DocETL

11:44Data Types, Formats, and the Starting Point for DocETL Pipelines

Data Types, Formats, and the Starting Point for DocETL Pipelines

Part 3: Development, UX, Optimization

13:31DocWrangler: An IDE for Writing and Evaluating DocETL Pipelines

DocWrangler: An IDE for Writing and Evaluating DocETL Pipelines

15:31UX Innovation for Semantic Data Processing: Aligning Users, Pipelines, and Data

UX Innovation for Semantic Data Processing: Aligning Users, Pipelines, and Data

17:59Reasoning Models and Supervised Fine-Tuning for DocETL Pipelines

Reasoning Models and Supervised Fine-Tuning for DocETL Pipelines

20:04Consensus-Based Pipelines and Model Selection in DocETL

Consensus-Based Pipelines and Model Selection in DocETL

Part 4: Future, Ecosystem, Scaling

22:40Multimodal Foundation Models and the Future of DocETL

Multimodal Foundation Models and the Future of DocETL

23:54Related Projects and Intellectual Differences in Semantic Data Processing

Related Projects and Intellectual Differences in Semantic Data Processing

25:53Scaling DocETL and the Cost of LLMs

Scaling DocETL and the Cost of LLMs