large-language-models machine-learning mechanistic-interpretability

AtP*

Improved Attribution Patching for Localizing Large Model Behaviour

This repo contains code to perform the AtP* algorithm for improved Attribution Patching. The code is based on the AtP*: An efficient and scalable method for localizing LLM behaviour to components, Kramar et al. 2024 from DeepMind.

Attribution Patching (AtP) was introduced in Nanda 2022 as a quick approximation to the more precise Activation Patching (AcP) which details the contribution of each component to some metric (e.g. NLL loss, IOI score, etc.). It works by taking the first order Taylor approximation of the contribution c(n).

Appreciation

Thanks to Jaden and the nnsight team for the nnsight package that is used for the caching and interventions.

Thanks to Alice and the MechInterp Discord for the discussions and feedback.

Contributions

Contributions are welcome, please feel free to raise PRs to implement additional features you're interested in! 😄

Progress

About

PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)

large-language-models machine-learning mechanistic-interpretability

Languages

Language:Python 100.0%