In this episode of the Super Data Science podcast, host Jon Krohn interviews AI expert Sinan Ozdemir about the limitations and pitfalls of current AI benchmarks. They discuss how AI labs often "teach to test," skewing real-world performance, and the challenges of preventing test question leaks into training data. Ozdemir shares solutions such as creating custom test sets, implementing rubric-based evaluations, and chasing internal leaderboards. The conversation also covers multimodal model evaluation, agentic systems, and the use of perplexity and confidence signals. Ozdemir recommends "AI Snake Oil" as a relevant book, and they discuss his O'Reilly trainings and his podcast, "Practically Intelligent."
Sign in to continue reading, translating and more.
Continue