Ido Pesok, an engineer at Vercel working on V0, introduces evals at the application layer, distinguishing them from model-layer evals. He uses the analogy of a fruit letter counter app to illustrate the unreliability of LLMs and the importance of building reliable AI applications. Pesok emphasizes understanding the "court" or boundaries of your application's data, collecting relevant user prompts, and avoiding out-of-bounds or concentrated data sets. He advises on putting constants in data and variables in tasks, simplifying scores for debugging, and adding evals to CI for tracking improvements and regressions, ultimately advocating for evals as a core component for improving app reliability and quality.
Sign in to continue reading, translating and more.
Continue