This lecture explores methods for interpreting neural networks, focusing on both convolutional networks and more modern "frontier" models. The discussion begins with a case study, prompting listeners to brainstorm how to diagnose issues in a large language model experiencing problems with reasoning, safety, and latency. The lecture then dives into convolutional neural networks, detailing techniques like saliency maps, integrated gradients, and occlusion sensitivity to visualize how these models make decisions. A key method involves reverse engineering CNNs using deconvolutional modules to trace activations back to the input space. For frontier models, the lecture touches on analyzing attention patterns and embeddings, as well as training diagnostics like loss curves and scaling laws, to understand model behavior and performance. The importance of data diagnostics, including distribution checks and contamination detection, is also emphasized.
Sign in to continue reading, translating and more.
Continue