Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith | Latent Space: The AI Engineer Podcast

The podcast explores the evolution and business model of Artificial Analysis (AA), a company specializing in independent AI benchmarking. Founders Micah-Hill Smith and George Cameron discuss AA's origins as a side project born from the need for objective model evaluation, detailing their transition to a sustainable business with over 20 employees. They generate revenue through enterprise subscriptions and private benchmarking services, ensuring that public benchmarks remain unbiased. The conversation covers the technical challenges of AI evaluation, including prompt engineering, parsing, and the importance of statistical rigor. They also discuss the company's AI Grant experience and their efforts to develop new evaluation metrics, such as the Omniscience Index for measuring hallucination. The podcast further examines trends in AI, including the declining cost of intelligence, hardware efficiency, and the increasing importance of token efficiency.

Outlines

Part 1: Origins, Business, and Mission

Part 2: Benchmarking Methodology and Challenges

Part 3: Intelligence Indices and Progress Tracking

Part 4: Model Size, Performance, and Agentic Tasks

Part 5: Tools, Agents, and Openness

Part 6: Economics and Efficiency of AI

Part 7: Future Outlook and V4 Roadmap

Sign in to continue reading, translating and more.

Open full episode in Podwise

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast

Part 1: Origins, Business, and Mission

00:06Artificial Analysis' Origin Story and Business Model

Artificial Analysis' Origin Story and Business Model

01:23Artificial Analysis's Enterprise Services: Benchmarking and Insight Subscriptions

Artificial Analysis's Enterprise Services: Benchmarking and Insight Subscriptions

04:09The Impetus Behind Starting Artificial Analysis: A Developer's Need

The Impetus Behind Starting Artificial Analysis: A Developer's Need

07:09The Problem with Existing Benchmarks and the Need for Independent Evaluation

The Problem with Existing Benchmarks and the Need for Independent Evaluation

Part 2: Benchmarking Methodology and Challenges

11:28Nuances in Benchmarking: Variance, Confidence Intervals, and Special Endpoints

Nuances in Benchmarking: Variance, Confidence Intervals, and Special Endpoints

14:23The Conceptual Shenanigans of Benchmarking: Targeting Measured Metrics

The Conceptual Shenanigans of Benchmarking: Targeting Measured Metrics

16:11AI Grant Experience: Mentorship and Collaboration

AI Grant Experience: Mentorship and Collaboration

18:00AI Grant Companies as Power Users and the Evolution of Evals

AI Grant Companies as Power Users and the Evolution of Evals

Part 3: Intelligence Indices and Progress Tracking

20:46Evolving the Intelligence Index: From V1 to V3

Evolving the Intelligence Index: From V1 to V3

23:32Visualizing AI Progress: The Untouchable Era of OpenAI

Visualizing AI Progress: The Untouchable Era of OpenAI

27:35Introducing the Omniscience Index: Measuring Embedded Knowledge and Hallucination

Introducing the Omniscience Index: Measuring Embedded Knowledge and Hallucination

29:20Hallucination and Calibration: A Key Request

Hallucination and Calibration: A Key Request

31:23Hallucination Rates and Intelligence: No Strong Correlation

Hallucination Rates and Intelligence: No Strong Correlation

33:07Hallucination as a Feature: Critical Point and the Need for Context

Hallucination as a Feature: Critical Point and the Need for Context

Part 4: Model Size, Performance, and Agentic Tasks

35:09Building Evals Internally and Partnering with Academia and AI Companies

Building Evals Internally and Partnering with Academia and AI Companies

37:04Estimating Model Size: Parameters and Inference Costs

Estimating Model Size: Parameters and Inference Costs

39:38GDP Val AA: A Fantastic Dataset for Broad White-Collar Work

GDP Val AA: A Fantastic Dataset for Broad White-Collar Work

41:24Turning GDP Val into an Eval: The Reference Agentic Harness

Turning GDP Val into an Eval: The Reference Agentic Harness

43:27ELO vs. Percentage: Evaluating Document Outputs

ELO vs. Percentage: Evaluating Document Outputs

45:04Llama IV Maverick and the Performance of Web Chatbots

Llama IV Maverick and the Performance of Web Chatbots

Part 5: Tools, Agents, and Openness

47:10The Power of Data Connections: Drafting Emails with LLMs

The Power of Data Connections: Drafting Emails with LLMs

49:03SuperBase and the Generalist Agentic Harness

SuperBase and the Generalist Agentic Harness

50:39Stirrup: A Minimalist Agentic Harness Released on GitHub

Stirrup: A Minimalist Agentic Harness Released on GitHub

52:38Introducing the Openness Index: Measuring Transparency in Model Development

Introducing the Openness Index: Measuring Transparency in Model Development

54:40The Trade-offs of Openness: Intelligence vs. Transparency

The Trade-offs of Openness: Intelligence vs. Transparency

56:29Licensing Worries and the Openness Index

Licensing Worries and the Openness Index

Part 6: Economics and Efficiency of AI

58:26The Falling Cost of Intelligence and the Smiling Curve

The Falling Cost of Intelligence and the Smiling Curve

1:01:06The Multipliers: Big Models, Reasoning, and Agentic Workflows

The Multipliers: Big Models, Reasoning, and Agentic Workflows

1:04:05Hardware Efficiency and Sparsity

Hardware Efficiency and Sparsity

1:05:54Reasoning vs. Non-Reasoning Models and Token Efficiency

Reasoning vs. Non-Reasoning Models and Token Efficiency

1:07:52Token Usage and Difficulty: The Behavior You Want in a Model

Token Usage and Difficulty: The Behavior You Want in a Model

1:09:08Token Efficiency vs. Number of Turns Efficiency

Token Efficiency vs. Number of Turns Efficiency

Part 7: Future Outlook and V4 Roadmap

1:10:13Multi-Turn Benchmarks and Creative Direction

Multi-Turn Benchmarks and Creative Direction

1:12:17Infographics and Workhorse Use Cases

Infographics and Workhorse Use Cases

1:13:56The Insatiable Demand for AI Intelligence

The Insatiable Demand for AI Intelligence

1:15:00V4 of the Intelligence Index: GDP Val, Critical Point, and Omniscience