Tejal from OpenAI's reinforcement learning team introduces the topic of evaluating AI models, emphasizing the shift from traditional benchmarks to real-world task performance using GDPVal, which measures model capabilities across various economically valuable sectors. She highlights the progress made with models like GPT-5 in achieving parity with human experts on complex tasks. Henry, from the product team, discusses the importance of evaluations for builders of AI applications and introduces new features in OpenAI's evals product, such as datasets, traces, automated prompt optimization, third-party model support, and enterprise readiness. He uses a demo with Carlyle to illustrate how these tools can be used to evaluate and improve AI agents, stressing the importance of integration, automation, ease of use, and starting the evaluation process early with real human data and subject matter expert annotations.
Sign in to continue reading, translating and more.
Continue