models. Second, we show that these adversaries are uniquely versatile and highly robust. We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale. Third, we show how these adversarial images can be used as a practical interpretability tool for identifying bugs in networks. We use these adversaries to make predictions about spurious associations between features and classes which we then test by designing "copy/paste" attacks in which one natural image is pasted into another to cause a targeted misclassification. Our results suggest that feature-level attacks are a promising approach for rigorous interpretability research.

Original text:

Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.

---

A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.

11:["$","$L20",null,{"data":{"isPreview":true,"seq":6154035,"episode":{"Id":"79dd9e873eb878a160e3f8a4332ea94acc27a762bdd22f32708504e1a5cbc8d3","Seq":6154035,"PodId":"3e542365ff1426ffa58325d43a34e8a37648135facd5ea3e737a74fc913657ef","PodSeq":21115,"Title":"Robust Feature-Level Adversaries Are Interpretability Tools","PodName":"AI Safety Fundamentals","Description":"$21","Url":"","Link":"https://www.buzzsprout.com/2334908/episodes/16381936-robust-feature-level-adversaries-are-interpretability-tools.mp3","LinkType":"mp3","PublishTime":"$D2025-01-04T22:00:00.000Z","Img":"https://storage.buzzsprout.com/03rohnsryl5w27ywlv5msbsbw2fk?.jpg","EpImg":"https://storage.buzzsprout.com/03rohnsryl5w27ywlv5msbsbw2fk?.jpg","Duration":"2133","Language":null,"SampleDuration":null,"IsVBR":fa