Aparna Dhinakaran from Arize discusses LLM evaluations and observability, differentiating between model evals and task evals, and focusing on the latter. She explains how LLMs are used as judges in real-world applications, particularly in chat-to-purchase e-commerce setups involving routers and function calls. Aparna emphasizes the importance of evals at different levels of an application, especially at the router level, and demonstrates this using Arize's open-source product, Phoenix, to trace and evaluate application performance. She shares best practices, highlighting the value of evals with explanations for effective iteration and improvement, and presents research findings on numeric vs. categorical evals and the impact of context window placement in RAG applications, concluding with an invitation to the Arize Observe event.
Sign in to continue reading, translating and more.
Continue