Scaling interpretability

In this podcast, Anthropic's interpretability team shares their latest efforts in making sense of large language models (LLMs). They successfully adapted a dictionary learning technique, previously applied to smaller models, to a full-scale LLM. This revealed unexpected and intricate features, such as a "veganism" trait that appeared across various languages and formats, including text and images. Additionally, they identified a "backdoors in code" feature that could recognize images of hidden surveillance devices. The team underscores the considerable engineering hurdles faced in scaling this approach, including challenges like managing data at a petabyte scale and the necessity for agile development and adaptable infrastructure to facilitate research. Their findings show that even seemingly straightforward techniques can yield valuable insights into the functioning of LLMs and enhance AI safety when implemented effectively.

Outlines

Sign in to continue reading, translating and more.

Continue

Anthropic

Introductions and Project Overview