ICLR 2024 — Best Papers & Talks (Benchmarks, Reasoning & Agents) — ft. Graham Neubig, Aman Sanger, Moritz Hardt) | Latent Space: The AI Engineer Podcast

This podcast episode explores various projects and benchmarks aimed at evaluating the performance of language model agents in realistic web-based tasks. It discusses the challenges faced by language models in navigation, filtering, math, and social scenarios, highlighting the gap between language models and humans. The episode also addresses the importance of evaluating language models and understanding their strengths and weaknesses. It introduces several evaluation benchmarks such as WebArena, Sotopia, SWEBench, GAIA, and DynaBench, each focusing on different aspects of language model performance. The discussion also covers topics like code generation, dataset contamination, dataset artifacts, benchmarks in the polymorphic era, and the concept of dynamic benchmarks. The episode concludes by exploring different frameworks and tools like Self-RAG, MetaGPT, and DSPy that aim to improve the performance, reliability, and versatility of language models.

Outlines

Sign in to continue reading, translating and more.

Continue

ICLR 2024 — Best Papers & Talks (Benchmarks, Reasoning & Agents) — ft. Graham Neubig, Aman Sanger, Moritz Hardt)

Latent Space: The AI Engineer Podcast

Benchmarking Language Models for Realistic Web-Based Tasks

Language Models and Social Situations: Challenges and Opportunities

Language Models for Social Situations and Performance Optimization of Code

Reactions to OpenDevin, the Importance of Planning, and the Future of OpenDevin

The Future of Coding Agents: OpenDevin, LM, and Performance Predictions

The Changing Role of Academia in the Face of GPT Models

Evaluating Language Models and the Surprising Similarity of Large Language Model Architectures

SWEBench: Benchmarking AI Systems for Software Engineering Abilities

The Challenge of Test Site Contamination in Language Models

Detecting Contamination in Language Models through Statistical Testing

GAIA: A Benchmark for Complex Tasks and Human Performance

The Science of Benchmarks and Their Importance in Machine Learning

The Longevity and Validity of Test Sets in Machine Learning

The Replication of Model Rankings on Radically Different Test Conditions

The ImageNet Era, Multitask Benchmarks, and Social Choice in Benchmarking

Sensitivity and Diversity in Benchmarks: The Trade-off Between Variance and Ranking Stability

The Emerging Science of Benchmarks

Introducing Self-RAG: Enhancing the Standard RAG System with Self-Evaluation and Retrieval

The Power of Self-RAG: Improving Performance and Citation Precision in RAG Systems

The Benefits of Consensus and the Advantages of Process Reward Models

Evaluating Capability and Safety of Models: Balancing Usefulness and Security

Balancing Usability and Safety in AI Model Deployment: Insights from a Discussion

WebAgent: A Holistic Approach to Real-World Web Navigation

MetaGPT: Enhancing Collaboration in Multi-Agent Systems

The Reversal Curse: Language Models' Failure to Generalize in the Reverse Order

DSPy: An Abstract and Efficient Framework for Optimizing Language Models

DSPy and Prompt Engineering: Challenges and Co-Optimization

ICLR 2024 — Best Papers & Talks (Benchmarks, Reasoning & Agents) — ft. Graham Neubig, Aman Sanger, Moritz Hardt)

Latent Space: The AI Engineer Podcast

00:05Benchmarking Language Models for Realistic Web-Based Tasks

Benchmarking Language Models for Realistic Web-Based Tasks

11:28Language Models and Social Situations: Challenges and Opportunities

Language Models and Social Situations: Challenges and Opportunities

20:08Language Models for Social Situations and Performance Optimization of Code

Language Models for Social Situations and Performance Optimization of Code

29:49Reactions to OpenDevin, the Importance of Planning, and the Future of OpenDevin

Reactions to OpenDevin, the Importance of Planning, and the Future of OpenDevin

38:28The Future of Coding Agents: OpenDevin, LM, and Performance Predictions

The Future of Coding Agents: OpenDevin, LM, and Performance Predictions

47:54The Changing Role of Academia in the Face of GPT Models

The Changing Role of Academia in the Face of GPT Models

57:33Evaluating Language Models and the Surprising Similarity of Large Language Model Architectures

Evaluating Language Models and the Surprising Similarity of Large Language Model Architectures

1:06:40SWEBench: Benchmarking AI Systems for Software Engineering Abilities

SWEBench: Benchmarking AI Systems for Software Engineering Abilities

1:20:56The Challenge of Test Site Contamination in Language Models

The Challenge of Test Site Contamination in Language Models

1:30:29Detecting Contamination in Language Models through Statistical Testing

Detecting Contamination in Language Models through Statistical Testing

1:40:26GAIA: A Benchmark for Complex Tasks and Human Performance

GAIA: A Benchmark for Complex Tasks and Human Performance

1:47:46The Science of Benchmarks and Their Importance in Machine Learning

The Science of Benchmarks and Their Importance in Machine Learning

1:56:16The Longevity and Validity of Test Sets in Machine Learning

The Longevity and Validity of Test Sets in Machine Learning

2:05:19The Replication of Model Rankings on Radically Different Test Conditions

The Replication of Model Rankings on Radically Different Test Conditions

2:12:50The ImageNet Era, Multitask Benchmarks, and Social Choice in Benchmarking

The ImageNet Era, Multitask Benchmarks, and Social Choice in Benchmarking

2:21:38Sensitivity and Diversity in Benchmarks: The Trade-off Between Variance and Ranking Stability

Sensitivity and Diversity in Benchmarks: The Trade-off Between Variance and Ranking Stability

2:28:51The Emerging Science of Benchmarks

The Emerging Science of Benchmarks

2:37:41Introducing Self-RAG: Enhancing the Standard RAG System with Self-Evaluation and Retrieval

Introducing Self-RAG: Enhancing the Standard RAG System with Self-Evaluation and Retrieval

2:48:59The Power of Self-RAG: Improving Performance and Citation Precision in RAG Systems

The Power of Self-RAG: Improving Performance and Citation Precision in RAG Systems

3:00:12The Benefits of Consensus and the Advantages of Process Reward Models

The Benefits of Consensus and the Advantages of Process Reward Models

3:11:56Evaluating Capability and Safety of Models: Balancing Usefulness and Security

Evaluating Capability and Safety of Models: Balancing Usefulness and Security

3:25:43Balancing Usability and Safety in AI Model Deployment: Insights from a Discussion

Balancing Usability and Safety in AI Model Deployment: Insights from a Discussion

3:36:56WebAgent: A Holistic Approach to Real-World Web Navigation

WebAgent: A Holistic Approach to Real-World Web Navigation

3:48:43MetaGPT: Enhancing Collaboration in Multi-Agent Systems

MetaGPT: Enhancing Collaboration in Multi-Agent Systems

4:04:56The Reversal Curse: Language Models' Failure to Generalize in the Reverse Order

The Reversal Curse: Language Models' Failure to Generalize in the Reverse Order

4:16:12DSPy: An Abstract and Efficient Framework for Optimizing Language Models

DSPy: An Abstract and Efficient Framework for Optimizing Language Models

4:24:02DSPy and Prompt Engineering: Challenges and Co-Optimization

DSPy and Prompt Engineering: Challenges and Co-Optimization