GPT 5.4 vs Gemini: Benchmarks, Codex, Excel

The podcast analyzes the performance of GPT-5.4 against other AI models like Gemini and Claude, particularly in the context of white-collar job automation. It highlights GPT-5.4's success in the GPT-Val benchmark, designed to measure AI's ability to perform real-world tasks, where it outperforms human experts 83% of the time. However, the discussion reveals that GPT-5.4 doesn't consistently lead across all benchmarks; Gemini 3.1 Pro Preview excels in areas like omniscience and reasoning. Karl Yeh joins the conversation, emphasizing the importance of testing AI models on specific use cases rather than relying solely on benchmarks. The hosts also explore the practical applications of these models, including their integration with tools like Excel and Google Sheets, and discuss potential privacy concerns related to devices like Meta Ray-Ban glasses.

Outlines

Part 1: AI Benchmarks, Performance

Part 2: Reliability, Real-World Use

Part 3: Agent Testing, Privacy

Sign in to continue reading, translating and more.

Continue

The Daily AI Show

Part 1: AI Benchmarks, Performance

ChatGPT-5.4 Release: AI's Impact on White-Collar Jobs and Nationalization Concerns

GPT-Val Benchmark: Measuring AI Performance Against Human Experts in Various Industries

Model Comparison: GPT-5.4's Strengths and Weaknesses Across Different Benchmarks

Part 2: Reliability, Real-World Use

Gemini's Unreliability: Hallucinations and the Need for Collaborative AI Models

Karl's Experience with GPT-5.4: Benchmarks and Real-World Use Cases

Excel Integration: Comparing Claude and ChatGPT for Financial Modeling

Part 3: Agent Testing, Privacy

Live Demo Disaster: Three AI Agents Collide in Excel

Google Workspace CLI and Meta's Privacy Violations with Smart Glasses

Ray-Ban Meta Glasses: Privacy Concerns and the Pope's Plea Against ChatGPT Sermons

GPT 5.4 vs Gemini: Benchmarks, Codex, Excel

The Daily AI Show

Part 1: AI Benchmarks, Performance

00:00ChatGPT-5.4 Release: AI's Impact on White-Collar Jobs and Nationalization Concerns

ChatGPT-5.4 Release: AI's Impact on White-Collar Jobs and Nationalization Concerns

04:42GPT-Val Benchmark: Measuring AI Performance Against Human Experts in Various Industries

GPT-Val Benchmark: Measuring AI Performance Against Human Experts in Various Industries

11:21Model Comparison: GPT-5.4's Strengths and Weaknesses Across Different Benchmarks

Model Comparison: GPT-5.4's Strengths and Weaknesses Across Different Benchmarks

Part 2: Reliability, Real-World Use

17:21Gemini's Unreliability: Hallucinations and the Need for Collaborative AI Models

Gemini's Unreliability: Hallucinations and the Need for Collaborative AI Models

23:35Karl's Experience with GPT-5.4: Benchmarks and Real-World Use Cases

Karl's Experience with GPT-5.4: Benchmarks and Real-World Use Cases

27:14Excel Integration: Comparing Claude and ChatGPT for Financial Modeling

Excel Integration: Comparing Claude and ChatGPT for Financial Modeling

Part 3: Agent Testing, Privacy

38:48Live Demo Disaster: Three AI Agents Collide in Excel

Live Demo Disaster: Three AI Agents Collide in Excel

45:16Google Workspace CLI and Meta's Privacy Violations with Smart Glasses

Google Workspace CLI and Meta's Privacy Violations with Smart Glasses

50:22Ray-Ban Meta Glasses: Privacy Concerns and the Pope's Plea Against ChatGPT Sermons

Ray-Ban Meta Glasses: Privacy Concerns and the Pope's Plea Against ChatGPT Sermons