The podcast features two main speakers, along with contributions from audience members, focusing on the challenges and methodologies of evaluating AI systems, particularly Large Language Models (LLMs). The discussion covers the importance of evals, the difficulties in defining metrics, the labor-intensive nature of the process, and the need for custom, subjective evaluations. The speakers share their experiences from Google, emphasizing the significance of benchmarking, establishing relevant metrics, and calibrating these metrics with human and user data. They introduce the concept of a scoring system that breaks down complex evaluations into simpler, more objective signals, and they also detail a hands-on workshop where attendees can experiment with building and refining their own scoring systems using a copilot tool, Google Sheets integration, and Python code in Colab.
Sign in to continue reading, translating and more.
Continue