Audio note: this article contains 38 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

Paper | Github | Demo Notebook

This post is about our recent paper Learning to Interpret Weight Differences in Language Models (Goel et al. Oct. 2025). We introduce a method for training a LoRA adapter that gives a finetuned model the ability to accurately describe the effects of its finetuning.

Figure 1: A demonstration of our method on Qwen3-8B. With the adapter applied, a model is able to answer questions about its finetuning changes. Try it yourself here.

WeightDiffQA

Our paper introduces and attempts to solve a task we call WeightDiffQA[1]:

Given a language model <span>_M_</span>, a weight diff <span>_delta_</span>, and a natural language question <span>_q_</span> about <span>_delta_</span>, output a correct natural language answer to <span>_q_</span>.