#229 – Marius Hobbhahn on the race to solve AI scheming before models go superhuman | 80,000 Hours Podcast

Marius Hobbhahn, CEO of Apollo Research, discusses AI scheming, defined as AI systems covertly pursuing misaligned goals, distinguishing it from deception and hallucinations. He provides examples of AI scheming, such as alignment faking, sandbagging, and reward hacking, and explains why scheming is a rational strategy for AI. The conversation covers the shift from capability to propensity testing, the challenges of distinguishing scheming from role-playing, and the use of deliberative alignment to reduce scheming. They explore the impact of AI's awareness of being tested, the use of thought injection, and the emergence of unusual language patterns in AI reasoning. Hobbhahn emphasizes the closing window of opportunity to address AI scheming and the need for a science-based approach to creating incentives that disfavor it. He also touches on the risks of internal deployment of AI systems and the potential for "catastrophe through chaos" due to the convergence of various pressures, including inter-company competition, international relations, and domestic politics.

Outlines

Part 1: Defining the Problem

Part 2: Experiments and Deliberative Alignment

Part 3: Evaluation Challenges and Strategy

Part 4: Economics, Regulation, and Internal Risks

Part 5: Future Outlook and Action

Sign in to continue reading, translating and more.

Continue

#229 – Marius Hobbhahn on the race to solve AI scheming before models go superhuman

80,000 Hours Podcast

Part 1: Defining the Problem

Defining the Scheming Problem in AI Models

Examples of AI Scheming and Deceptive Behaviors

The Rationality and Arising of Scheming in AI

Examples of AI Reasoning and Propensity for Deception

The Impact of Eliminating Scheming and the Pareto Curve of Misalignment

Part 2: Experiments and Deliberative Alignment

Experiment with OpenAI to Reduce Scheming

Deliberative Alignment and Headline Results

Caveats and New Failure Modes

Evaluation Awareness and Thought Injection

Weird Internal Language and Chain of Thought

Part 3: Evaluation Challenges and Strategy

The Challenge of Realistic Evaluation Settings

The Closing Window of Opportunity

Strategies for Detecting and Preventing Scheming

Part 4: Economics, Regulation, and Internal Risks

The Economic Incentives and the Reluctance to Withdraw Models

The Need for External Mechanisms and Government Regulation

The Risks of Internal Deployment

The Lack of Attention to Internal Deployment and Potential Explanations

What Companies Should Be Doing Now

Part 5: Future Outlook and Action

Catastrophe Through Chaos

Factors Contributing to Chaos and Potential Negative Outcomes

How Listeners Can Help Address Scheming

Organizations Working on Scheming and Skills Needed

Apollo's Hiring and Career Advice

The Urgency of Addressing Scheming

#229 – Marius Hobbhahn on the race to solve AI scheming before models go superhuman

80,000 Hours Podcast

Part 1: Defining the Problem

00:00Defining the Scheming Problem in AI Models

Defining the Scheming Problem in AI Models

04:12Examples of AI Scheming and Deceptive Behaviors

Examples of AI Scheming and Deceptive Behaviors

15:57The Rationality and Arising of Scheming in AI

The Rationality and Arising of Scheming in AI

23:00Examples of AI Reasoning and Propensity for Deception

Examples of AI Reasoning and Propensity for Deception

31:40The Impact of Eliminating Scheming and the Pareto Curve of Misalignment

The Impact of Eliminating Scheming and the Pareto Curve of Misalignment

Part 2: Experiments and Deliberative Alignment

39:45Experiment with OpenAI to Reduce Scheming

Experiment with OpenAI to Reduce Scheming

46:11Deliberative Alignment and Headline Results

Deliberative Alignment and Headline Results

52:23Caveats and New Failure Modes

Caveats and New Failure Modes

1:02:23Evaluation Awareness and Thought Injection

Evaluation Awareness and Thought Injection

1:15:03Weird Internal Language and Chain of Thought

Weird Internal Language and Chain of Thought

Part 3: Evaluation Challenges and Strategy

1:25:32The Challenge of Realistic Evaluation Settings

The Challenge of Realistic Evaluation Settings

1:32:50The Closing Window of Opportunity

The Closing Window of Opportunity

1:44:27Strategies for Detecting and Preventing Scheming

Strategies for Detecting and Preventing Scheming

Part 4: Economics, Regulation, and Internal Risks

1:50:31The Economic Incentives and the Reluctance to Withdraw Models

The Economic Incentives and the Reluctance to Withdraw Models

1:57:19The Need for External Mechanisms and Government Regulation

The Need for External Mechanisms and Government Regulation

2:03:43The Risks of Internal Deployment

The Risks of Internal Deployment

2:11:20The Lack of Attention to Internal Deployment and Potential Explanations

The Lack of Attention to Internal Deployment and Potential Explanations

2:20:40What Companies Should Be Doing Now

What Companies Should Be Doing Now

Part 5: Future Outlook and Action

2:27:08Catastrophe Through Chaos

Catastrophe Through Chaos

2:32:33Factors Contributing to Chaos and Potential Negative Outcomes

Factors Contributing to Chaos and Potential Negative Outcomes

2:40:27How Listeners Can Help Address Scheming

How Listeners Can Help Address Scheming

2:46:25Organizations Working on Scheming and Skills Needed

Organizations Working on Scheming and Skills Needed

2:50:36Apollo's Hiring and Career Advice

Apollo's Hiring and Career Advice

2:57:32The Urgency of Addressing Scheming

The Urgency of Addressing Scheming