In this interview, Hamel Husain discusses AI evaluations with Peter Yang, using NurtureBoss, an AI-powered property management assistant, as a practical example. Hamel emphasizes the importance of manually reviewing traces (records of user interactions) to identify specific product problems before applying automated evaluations. He introduces the concept of axial coding to categorize issues and suggests using LLMs to analyze and group notes from trace reviews. Hamel also cautions against relying solely on agreement scores for evaluating LLM judges, advocating for the use of true positive and true negative rates to measure the judge's accuracy. He recommends building a suite of evals over time and incorporating human labels regularly to ensure the AI system aligns with desired user experiences.
Sign in to continue reading, translating and more.
Continue