Lessons from the Trenches: Building LLM Evals That Work IRL: Aparna Dhinkaran

Aparna Dhinakaran from Arize discusses LLM evaluations and observability, differentiating between model evals and task evals, and focusing on the latter. She explains how LLMs are used as judges in real-world applications, particularly in chat-to-purchase e-commerce setups involving routers and function calls. Aparna emphasizes the importance of evals at different levels of an application, especially at the router level, and demonstrates this using Arize's open-source product, Phoenix, to trace and evaluate application performance. She shares best practices, highlighting the value of evals with explanations for effective iteration and improvement, and presents research findings on numeric vs. categorical evals and the impact of context window placement in RAG applications, concluding with an invitation to the Arize Observe event.

Outlines

Sign in to continue reading, translating and more.

Continue

AI Engineer

Introduction to LLM Evals and Task Evaluation

Deep Dive into Router Evals and Application Tracing

Best Practices for LLM Evals in Production

Model Eval Research: Needle in a Haystack and Retrieval with Generation

Lessons from the Trenches: Building LLM Evals That Work IRL: Aparna Dhinkaran

AI Engineer

00:13Introduction to LLM Evals and Task Evaluation

Introduction to LLM Evals and Task Evaluation

04:24Deep Dive into Router Evals and Application Tracing

Deep Dive into Router Evals and Application Tracing

09:19Best Practices for LLM Evals in Production

Best Practices for LLM Evals in Production

13:41Model Eval Research: Needle in a Haystack and Retrieval with Generation

Model Eval Research: Needle in a Haystack and Retrieval with Generation