What Do Neural Networks Really Learn? Exploring the Brain of an AI Model

Mechanistic interpretability provides a framework for demystifying the "black box" nature of deep learning models. By analyzing individual neurons and their connections within convolutional neural networks, researchers can map how specific features—ranging from simple curves to complex objects like dog heads or car parts—are identified. This process involves feature visualization and circuit analysis to uncover the internal logic of models like InceptionV1. However, the phenomenon of polysemanticity, where a single neuron tracks multiple unrelated features, presents a significant challenge to clear interpretation. As automated systems increasingly influence high-stakes sectors such as healthcare and criminal justice, understanding the internal decision-making processes of these models is essential. This field of study functions like a natural science, relying on experimentation and hypothesis testing to gain confidence in how artificial intelligence processes information and reaches its conclusions.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Rational Animations

The Black Box Problem and Mechanics of Convolutional Neural Networks

Investigating Individual Neurons and Feature Visualization

Decoding Neural Circuits and Hierarchical Feature Detection

Polysemanticity and the Future of Mechanistic Interpretability

What Do Neural Networks Really Learn? Exploring the Brain of an AI Model

Rational Animations

00:00The Black Box Problem and Mechanics of Convolutional Neural Networks

The Black Box Problem and Mechanics of Convolutional Neural Networks

05:15Investigating Individual Neurons and Feature Visualization

Investigating Individual Neurons and Feature Visualization

09:52Decoding Neural Circuits and Hierarchical Feature Detection

Decoding Neural Circuits and Hierarchical Feature Detection

13:40Polysemanticity and the Future of Mechanistic Interpretability

Polysemanticity and the Future of Mechanistic Interpretability