Awesome Interpretability in Large Language Models

The area of interpretability in large language models (LLMs) has been growing rapidly in recent years. This repository tries to collect all relevant resources to help beginners quickly get started in this area and help researchers to keep up with the latest research progress.

This is an active repository and welcome to open a new issue if I miss any relevant resources. If you have any questions or suggestions, please feel free to contact me via email: ruizhe.li@abdn.ac.uk.

Table of Contents

Awesome Interpretability Libraries
Awesome Interpretability Blogs & Videos
Awesome Interpretability Tutorials
Awesome Interpretability Forums
Awesome Interpretability Tools
Awesome Interpretability Programs
Awesome Interpretability Papers
Other Awesome Interpretability Resources

Awesome Interpretability Libraries

TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models. (Doc, Tutorial, Demo)
nnsight: enables interpreting and manipulating the internals of deep learned models. (Doc, Tutorial)
SAE Lens: train and analyse SAE. (Doc, Tutorial, Blog)
Automatic Circuit DisCovery: automatically build circuit for mechanistic interpretability. (Paper, Demo)
Pyvene: A Library for Understanding and Improving PyTorch Models via Interventions. (Paper, Demo)
pyreft: A Powerful, Efficient and Interpretable fine-tuning method. (Paper, Demo)
repeng: A Python library for generating control vectors with representation engineering. (Paper, Blog)
Penzai: a JAX library for writing models as legible, functional pytree data structures, along with tools for visualizing, modifying, and analyzing them. (Doc, Tutorial)
LXT: LRP eXplains Transformers: Layer-wise Relevance Propagation (LRP) extended to handle attention layers in Large Language Models (LLMs) and Vision Transformers (ViTs). (Paper, Doc)
Tuned Lens: Tools for understanding how transformer predictions are built layer-by-layer. (Paper, Doc)
Inseq: Pytorch-based toolkit for common post-hoc interpretability analyses of sequence generation models. (Paper, Doc)

Awesome Interpretability Blogs & Videos

Awesome Interpretability Tutorials

ARENA 3.0: understand mechanistic interpretability using TransformerLens.
EACL24: Transformer-specific Interpretability (Github)

Awesome Interpretability Forums

Awesome Interpretability Tools

Transformer Debugger: investigate specific behaviors of small LLMs
LLM Transparency Tool (Demo)
sae_vis: a tool to replicate Anthropic's sparse autoencoder visualisations (Demo)
Neuronpedia: an open platform for interpretability research. (Doc)

Awesome Interpretability Programs

ML Alignment & Theory Scholars (MATS): an independent research and educational seminar program that connects talented scholars with top mentors in the fields of AI alignment, interpretability, and governance.

Awesome Interpretability Papers

Survey Papers

Title	Venue	Date	Code
From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP	arXiv	2024-06-18	-
A Primer on the Inner Workings of Transformer-based Language Models	arXiv	2024-05-02	-
Mechanistic Interpretability for AI Safety -- A Review	arXiv	2024-04-22	-
From Understanding to Utilization: A Survey on Explainability for Large Language Models	arXiv	2024-02-22	-
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks	arXiv	2023-08-18	-

Position Papers

Title	Venue	Date	Code
Position Paper: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience	ICML	2024-06-03	-
Interpretability Needs a New Paradigm	arXiv	2024-05-08	-
Position Paper: Toward New Frameworks for Studying Model Representations	arXiv	2024-02-06	-
Rethinking Interpretability in the Era of Large Language Models	arXiv	2024-01-30	-

Interpretable Analysis of LLMs

Title	Venue	Date	Code	Blog
Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation	arXiv	2024-07-01	Github	-
Recovering the Pre-Fine-Tuning Weights of Generative Models	ICML	2024-07-01	Github	Blog
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs	arXiv	2024-06-28	Github	Blog
Multi-property Steering of Large Language Models with Dynamic Activation Composition	arXiv	2024-06-25	-	-
Confidence Regulation Neurons in Language Models	arXiv	2024-06-24	-	-
Compact Proofs of Model Performance via Mechanistic Interpretability	arXiv	2024-06-24	Github	-
Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models	arXiv	2024-06-23	-	-
Estimating Knowledge in Large Language Models Without Generating a Single Token	arXiv	2024-06-18	Github	-
Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations	arXiv	2024-06-17	-	-
Transcoders Find Interpretable LLM Feature Circuits	arXiv	2024-06-17	Github	-
Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue	arXiv	2024-06-16	Github	-
Context versus Prior Knowledge in Language Models	ACL	2024-06-16	Github	-
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models	arXiv	2024-06-13	-	-
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models	ICML	2024-06-06	Github	Blog
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals	ACL	2024-06-06	Github	-
Learned feature representations are biased by complexity, learning order, position, and more	arXiv	2024-06-06	Demo	-
Iteration Head: A Mechanistic Study of Chain-of-Thought	arXiv	2024-06-05	-	-
Activation Addition: Steering Language Models Without Optimization	arXiv	2024-06-04	Code	-
Interpretability Illusions in the Generalization of Simplified Models	arXiv	2024-06-04	-	-
SyntaxShap: Syntax-aware Explainability Method for Text Generation	arXiv	2024-06-03	Github	Blog
Calibrating Reasoning in Language Models with Internal Consistency	arXiv	2024-05-29	-	-
Black-Box Access is Insufficient for Rigorous AI Audits	FAccT	2024-05-29	-	-
Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting	arXiv	2024-05-28	-	-
From Neurons to Neutrons: A Case Study in Interpretability	ICML	2024-05-27	Github	-
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization	arXiv	2024-05-27	Github	-
Explorations of Self-Repair in Language Models	ICML	2024-05-26	Github	-
Emergence of a High-Dimensional Abstraction Phase in Language Transformers	arXiv	2024-05-24	-	-
Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions	arXiv	2024-05-23	Github	-
Not All Language Model Features Are Linear	arXiv	2024-05-23	Github	-
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability	arXiv	2024-05-20	-	-
Your Transformer is Secretly Linear	arXiv	2024-05-19	Github	-
Are self-explanations from Large Language Models faithful?	ACL	2024-05-16	Github	-
Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models	arXiv	2024-05-14	-	-
Steering Llama 2 via Contrastive Activation Addition	arXiv	2024-05-07	Github	-
How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability	AISTATS	2024-05-07	Github	-
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning	arXiv	2024-05-06	Github	-
Circuit Component Reuse Across Tasks in Transformer Language Models	ICLR	2024-05-06	Github	-
LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools and Self-Explanations	HCI+NLP@NAACL	2024-04-24	Github	-
How to use and interpret activation patching	arXiv	2024-04-23	-	-
Understanding Addition in Transformers	arXiv	2024-04-23	-	-
Towards Uncovering How Large Language Model Works: An Explainability Perspective	arXiv	2024-04-15	-	-
What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation	ICML	2024-04-10	Github	-
Does Transformer Interpretability Transfer to RNNs?	arXiv	2024-04-09	-	-
Locating and Editing Factual Associations in Mamba	arXiv	2024-04-04	Github	Demo
Eliciting Latent Knowledge from Quirky Language Models	ME-FoMo@ICLR	2024-04-03	-	-
Do language models plan ahead for future tokens?	arXiv	2024-04-01	-	-
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models	arXiv	2024-03-31	Github	Demo
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms	arXiv	2024-03-26	-	-
Language Models Represent Space and Time	ICLR	2024-03-04	Github	-
*AtP: An efficient and scalable method for localizing LLM behaviour to components**	arXiv	2024-03-01	-	-
A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task	arXiv	2024-02-28	-	-
Function Vectors in Large Language Models	ICLR	2024-02-25	Github	Blog
A Language Model's Guide Through Latent Space	arXiv	2024-02-22	-	-
Interpreting Shared Circuits for Ordered Sequence Prediction in a Large Language Model	arXiv	2024-02-22	-	-
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking	ICLR	2024-02-22	Github	Blog
Fine-grained Hallucination Detection and Editing for Language Models	arXiv	2024-02-21	Github	Blog
Enhanced Hallucination Detection in Neural Machine Translation through Simple Detector Aggregation	arXiv	2024-02-20	Github	-
Identifying Semantic Induction Heads to Understand In-Context Learning	arXiv	2024-02-20	-	-
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space	arXiv	2024-02-20	-	-
Show Me How It's Done: The Role of Explanations in Fine-Tuning Language Models	ACML	2024-02-12	-	-
Model Editing with Canonical Examples	arXiv	2024-02-09	Github	-
Opening the AI black box: program synthesis via mechanistic interpretability	arXiv	2024-02-07	Github	-
INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection	ICLR	2024-02-06	-	-
In-Context Language Learning: Architectures and Algorithms	arXiv	2024-01-30	Github	-
Gradient-Based Language Model Red Teaming	EACL	2024-01-30	Github	-
The Calibration Gap between Model and Human Confidence in Large Language Models	arXiv	2024-01-24	-	-
Universal Neurons in GPT2 Language Models	arXiv	2024-01-22	Github	-
The mechanistic basis of data dependence and abrupt learning in an in-context classification task	ICLR	2024-01-16	-	-
Overthinking the Truth: Understanding how Language Models Process False Demonstrations	ICLR	2024-01-16	Github	-
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks	ICLR	2024-01-16	-	-
Feature emergence via margin maximization: case studies in algebraic tasks	ICLR	2024-01-16	-	-
Successor Heads: Recurring, Interpretable Attention Heads In The Wild	ICLR	2024-01-16	-	-
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods	ICLR	2024-01-16	-	-
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity	arXiv	2024-01-03	Github	-
Forbidden Facts: An Investigation of Competing Objectives in Llama-2	ATTRIB@NeurIPS	2023-12-31	Github	Blog
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets	arXiv	2023-12-08	Github	Blog
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching	ATTRIB@NeurIPS	2023-12-06	Github	-
Structured World Representations in Maze-Solving Transformers	UniReps@NeurIPS	2023-12-05	Github	-
Generating Interpretable Networks using Hypernetworks	arXiv	2023-12-05	-	-
The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks	NeurIPS	2023-11-21	Github	-
Attribution Patching Outperforms Automated Circuit Discovery	ATTRIB@NeurIPS	2023-11-20	Github	-
Tracr: Compiled Transformers as a Laboratory for Interpretability	NeurIPS	2023-11-03	Github	-
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model	NeurIPS	2023-11-02	Github	-
Learning Transformer Programs	NeurIPS	2023-10-31	Github	-
Towards Automated Circuit Discovery for Mechanistic Interpretability	NeurIPS	2023-10-28	Github	-
Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models	EMNLP	2023-10-23	Github	-
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model	NeurIPS	2023-10-20	Github	-
Progress measures for grokking via mechanistic interpretability	ICLR	2023-10-19	Github	Blog
Copy Suppression: Comprehensively Understanding an Attention Head	arXiv	2023-10-06	Github	Blog & Demo
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models	NeurIPS	2023-09-21	Github	-
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca	NeurIPS	2023-09-21	Github	-
Emergent Linear Representations in World Models of Self-Supervised Sequence Models	BlackboxNLP@EMNLP	2023-09-07	Github	Blog
Finding Neurons in a Haystack: Case Studies with Sparse Probing	arXiv	2023-06-02	Github	-
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations	ICML	2023-05-24	Github	-
Localizing Model Behavior with Path Patching	arXiv	2023-05-16	-	-
Language models can explain neurons in language models	OpenAI	2023-05-09	-	-
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models	ICLR Workshop	2023-04-22	-	-
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small	ICLR	2023-01-20	Github	-
Interpreting Neural Networks through the Polytope Lens	arXiv	2022-11-22	-	-
Scaling Laws and Interpretability of Learning from Repeated Data	arXiv	2022-05-21	-	-
In-context Learning and Induction Heads	Anthropic	2022-03-08	-	-
A Mathematical Framework for Transformer Circuits	Anthropic	2021-12-22	-	-

SAE, Dictionary Learning and Superposition

Title	Venue	Date	Code	Blog
Interpreting Attention Layer Outputs with Sparse Autoencoders	arXiv	2024-06-25	-	Demo
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning	arXiv	2024-05-24	Github	-
Improving Language Models Trained with Translated Data via Continual Pre-Training and Dictionary Learning Analysis	arXiv	2024-05-23	-	-
Automatically Identifying Local and Global Circuits with Linear Computation Graphs	arXiv	2024-05-22	-	-
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet	Anthropic	2024-05-21	-	Demo
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models	arXiv	2024-05-21	-	-
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control	arXiv	2024-05-20	-	-
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks	arXiv	2024-05-20	Github	-
Improving Dictionary Learning with Gated Sparse Autoencoders	arXiv	2024-04-30	-	-
Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers	LessWrong	2024-04-29	-	Demo
Activation Steering with SAEs	LessWrong	2024-04-19	-	-
SAE reconstruction errors are (empirically) pathological	LessWrong	2024-03-29	-	-
Sparse autoencoders find composed features in small toy models	LessWrong	2024-03-14	Github	-
Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT	LessWrong	2024-03-05	Github	-
Do sparse autoencoders find "true features"?	LessWrong	2024-02-12	-	-
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT	arXiv	2024-02-19	-	-
Toward A Mathematical Framework for Computation in Superposition	LessWrong	2024-01-18	-	-
Sparse Autoencoders Work on Attention Layer Outputs	LessWrong	2024-01-16	-	Demo
Sparse Autoencoders Find Highly Interpretable Features in Language Models	ICLR	2024-01-16	Github	-
Codebook Features: Sparse and Discrete Interpretability for Neural Networks	arXiv	2023-10-26	Github	Demo
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning	Anthropic	2023-10-04	Github	Demo-1, Demo-2, Tutorial
Polysemanticity and Capacity in Neural Networks	arXiv	2023-07-12	-	-
Distributed Representations: Composition & Superposition	Anthropic	2023-05-04	-	-
Superposition, Memorization, and Double Descent	Anthropic	2023-01-05	-	-
Engineering Monosemanticity in Toy Models	arXiv	2022-11-16	Github	-
Toy Models of Superposition	Anthropic	2022-09-14	Github	Demo
Softmax Linear Units	Anthropic	2022-06-27	-	-
Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors	DeeLIO@NAACL	2021-03-29	Github	-
Zoom In: An Introduction to Circuits	Distill	2020-03-10	-	-

Interpretability in Vision LLMs

Title	Venue	Date	Code	Blog
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Noise-free Text-Image Corruption and Evaluation	arXiv	2024-06-24	Github	-
PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits	XAI4CV@CVPR	2024-04-09	Github	-
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)	arXiv	2024-02-16	Github	-
Analyzing Vision Transformers for Image Classification in Class Embedding Space	NeurIPS	2023-09-21	Github	-
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP	CLVL@ICCV	2023-08-27	Github	-
Scale Alone Does not Improve Mechanistic Interpretability in Vision Models	NeurIPS	2023-07-11	Github	Blog

Benchmarking Interpretability

Title	Venue	Date	Code	Blog
Benchmarking Mental State Representations in Language Models	MI@ICML	2024-06-25	-	-
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains	ACL	2024-05-21	Dataset	Blog
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations	arXiv	2024-02-27	Github	-
CausalGym: Benchmarking causal interpretability methods on linguistic tasks	arXiv	2024-02-19	Github	-

Enhancing Interpretability

Title	Venue	Date	Code	Blog
Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability	arXiv	2024-01-08	-	-
Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability	arXiv	2023-06-06	Github	-

Others

Title	Venue	Date	Code	Blog
An introduction to graphical tensor notation for mechanistic interpretability	arXiv	2024-02-02	-	-
Episodic Memory Theory for the Mechanistic Interpretation of Recurrent Neural Networks	arXiv	2023-10-03	Github	-

Other Awesome Interpretability Resources

Butanium / Awesome-Interpretability-in-Large-Language-Models