Butanium / Awesome-Interpretability-in-Large-Language-Models

This repository collects all relevant resources about interpretability in LLMs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Awesome Interpretability in Large Language Models

The area of interpretability in large language models (LLMs) has been growing rapidly in recent years. This repository tries to collect all relevant resources to help beginners quickly get started in this area and help researchers to keep up with the latest research progress.

This is an active repository and welcome to open a new issue if I miss any relevant resources. If you have any questions or suggestions, please feel free to contact me via email: ruizhe.li@abdn.ac.uk.


Table of Contents


Awesome Interpretability Libraries

  • GitHub Repo stars TransformerLens: A Library for Mechanistic Interpretability of Generative Language Models. (Doc, Tutorial, Demo)
  • GitHub Repo stars nnsight: enables interpreting and manipulating the internals of deep learned models. (Doc, Tutorial)
  • GitHub Repo stars SAE Lens: train and analyse SAE. (Doc, Tutorial, Blog)
  • GitHub Repo stars Automatic Circuit DisCovery: automatically build circuit for mechanistic interpretability. (Paper, Demo)
  • GitHub Repo stars Pyvene: A Library for Understanding and Improving PyTorch Models via Interventions. (Paper, Demo)
  • GitHub Repo stars pyreft: A Powerful, Efficient and Interpretable fine-tuning method. (Paper, Demo)
  • GitHub Repo stars repeng: A Python library for generating control vectors with representation engineering. (Paper, Blog)
  • GitHub Repo stars Penzai: a JAX library for writing models as legible, functional pytree data structures, along with tools for visualizing, modifying, and analyzing them. (Doc, Tutorial)
  • GitHub Repo stars LXT: LRP eXplains Transformers: Layer-wise Relevance Propagation (LRP) extended to handle attention layers in Large Language Models (LLMs) and Vision Transformers (ViTs). (Paper, Doc)
  • GitHub Repo stars Tuned Lens: Tools for understanding how transformer predictions are built layer-by-layer. (Paper, Doc)
  • GitHub Repo stars Inseq: Pytorch-based toolkit for common post-hoc interpretability analyses of sequence generation models. (Paper, Doc)

Awesome Interpretability Blogs & Videos

Awesome Interpretability Tutorials

Awesome Interpretability Forums

Awesome Interpretability Tools

Awesome Interpretability Programs

  • ML Alignment & Theory Scholars (MATS): an independent research and educational seminar program that connects talented scholars with top mentors in the fields of AI alignment, interpretability, and governance.

Awesome Interpretability Papers

Survey Papers

Title Venue Date Code
From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP arXiv 2024-06-18 -
A Primer on the Inner Workings of Transformer-based Language Models arXiv 2024-05-02 -
Mechanistic Interpretability for AI Safety -- A Review arXiv 2024-04-22 -
From Understanding to Utilization: A Survey on Explainability for Large Language Models arXiv 2024-02-22 -
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks arXiv 2023-08-18 -

Position Papers

Title Venue Date Code
Position Paper: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience ICML 2024-06-03 -
Interpretability Needs a New Paradigm arXiv 2024-05-08 -
Position Paper: Toward New Frameworks for Studying Model Representations arXiv 2024-02-06 -
Rethinking Interpretability in the Era of Large Language Models arXiv 2024-01-30 -

Interpretable Analysis of LLMs

Title Venue Date Code Blog
GitHub Repo stars
Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation
arXiv 2024-07-01 Github -
GitHub Repo stars
Recovering the Pre-Fine-Tuning Weights of Generative Models
ICML 2024-07-01 Github Blog
GitHub Repo stars
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
arXiv 2024-06-28 Github Blog
Multi-property Steering of Large Language Models with Dynamic Activation Composition
arXiv 2024-06-25 - -
Confidence Regulation Neurons in Language Models
arXiv 2024-06-24 - -
GitHub Repo stars
Compact Proofs of Model Performance via Mechanistic Interpretability
arXiv 2024-06-24 Github -
Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models
arXiv 2024-06-23 - -
GitHub Repo stars
Estimating Knowledge in Large Language Models Without Generating a Single Token
arXiv 2024-06-18 Github -
Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations
arXiv 2024-06-17 - -
GitHub Repo stars
Transcoders Find Interpretable LLM Feature Circuits
arXiv 2024-06-17 Github -
GitHub Repo stars
Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue
arXiv 2024-06-16 Github -
GitHub Repo stars
Context versus Prior Knowledge in Language Models
ACL 2024-06-16 Github -
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models
arXiv 2024-06-13 - -
GitHub Repo stars
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
ICML 2024-06-06 Github Blog
GitHub Repo stars
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
ACL 2024-06-06 Github -
Learned feature representations are biased by complexity, learning order, position, and more
arXiv 2024-06-06 Demo -
Iteration Head: A Mechanistic Study of Chain-of-Thought
arXiv 2024-06-05 - -
Activation Addition: Steering Language Models Without Optimization
arXiv 2024-06-04 Code -
Interpretability Illusions in the Generalization of Simplified Models
arXiv 2024-06-04 - -
GitHub Repo stars
SyntaxShap: Syntax-aware Explainability Method for Text Generation
arXiv 2024-06-03 Github Blog
Calibrating Reasoning in Language Models with Internal Consistency
arXiv 2024-05-29 - -
Black-Box Access is Insufficient for Rigorous AI Audits
FAccT 2024-05-29 - -
Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting
arXiv 2024-05-28 - -
GitHub Repo stars
From Neurons to Neutrons: A Case Study in Interpretability
ICML 2024-05-27 Github -
GitHub Repo stars
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
arXiv 2024-05-27 Github -
GitHub Repo stars
Explorations of Self-Repair in Language Models
ICML 2024-05-26 Github -
Emergence of a High-Dimensional Abstraction Phase in Language Transformers
arXiv 2024-05-24 - -
GitHub Repo stars
Anchored Answers: Unravelling Positional Bias in GPT-2's Multiple-Choice Questions
arXiv 2024-05-23 Github -
GitHub Repo stars
Not All Language Model Features Are Linear
arXiv 2024-05-23 Github -
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
arXiv 2024-05-20 - -
GitHub Repo stars
Your Transformer is Secretly Linear
arXiv 2024-05-19 Github -
GitHub Repo stars
Are self-explanations from Large Language Models faithful?
ACL 2024-05-16 Github -
Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models
arXiv 2024-05-14 - -
GitHub Repo stars
Steering Llama 2 via Contrastive Activation Addition
arXiv 2024-05-07 Github -
GitHub Repo stars
How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability
AISTATS 2024-05-07 Github -
GitHub Repo stars
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
arXiv 2024-05-06 Github -
GitHub Repo stars
Circuit Component Reuse Across Tasks in Transformer Language Models
ICLR 2024-05-06 Github -
GitHub Repo stars
LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools and Self-Explanations
HCI+NLP@NAACL 2024-04-24 Github -
How to use and interpret activation patching
arXiv 2024-04-23 - -
Understanding Addition in Transformers
arXiv 2024-04-23 - -
Towards Uncovering How Large Language Model Works: An Explainability Perspective
arXiv 2024-04-15 - -
GitHub Repo stars
What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation
ICML 2024-04-10 Github -
Does Transformer Interpretability Transfer to RNNs?
arXiv 2024-04-09 - -
GitHub Repo stars
Locating and Editing Factual Associations in Mamba
arXiv 2024-04-04 Github Demo
Eliciting Latent Knowledge from Quirky Language Models
ME-FoMo@ICLR 2024-04-03 - -
Do language models plan ahead for future tokens?
arXiv 2024-04-01 - -
GitHub Repo stars
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
arXiv 2024-03-31 Github Demo
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
arXiv 2024-03-26 - -
GitHub Repo stars
Language Models Represent Space and Time
ICLR 2024-03-04 Github -
AtP*: An efficient and scalable method for localizing LLM behaviour to components
arXiv 2024-03-01 - -
A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task
arXiv 2024-02-28 - -
GitHub Repo stars
Function Vectors in Large Language Models
ICLR 2024-02-25 Github Blog
A Language Model's Guide Through Latent Space
arXiv 2024-02-22 - -
Interpreting Shared Circuits for Ordered Sequence Prediction in a Large Language Model
arXiv 2024-02-22 - -
GitHub Repo stars
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking
ICLR 2024-02-22 Github Blog
GitHub Repo stars
Fine-grained Hallucination Detection and Editing for Language Models
arXiv 2024-02-21 Github Blog
GitHub Repo stars
Enhanced Hallucination Detection in Neural Machine Translation through Simple Detector Aggregation
arXiv 2024-02-20 Github -
Identifying Semantic Induction Heads to Understand In-Context Learning
arXiv 2024-02-20 - -
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
arXiv 2024-02-20 - -
Show Me How It's Done: The Role of Explanations in Fine-Tuning Language Models
ACML 2024-02-12 - -
GitHub Repo stars
Model Editing with Canonical Examples
arXiv 2024-02-09 Github -
GitHub Repo stars
Opening the AI black box: program synthesis via mechanistic interpretability
arXiv 2024-02-07 Github -
INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection
ICLR 2024-02-06 - -
GitHub Repo stars
In-Context Language Learning: Architectures and Algorithms
arXiv 2024-01-30 Github -
Gradient-Based Language Model Red Teaming
EACL 2024-01-30 Github -
The Calibration Gap between Model and Human Confidence in Large Language Models
arXiv 2024-01-24 - -
GitHub Repo stars
Universal Neurons in GPT2 Language Models
arXiv 2024-01-22 Github -
The mechanistic basis of data dependence and abrupt learning in an in-context classification task
ICLR 2024-01-16 - -
GitHub Repo stars
Overthinking the Truth: Understanding how Language Models Process False Demonstrations
ICLR 2024-01-16 Github -
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
ICLR 2024-01-16 - -
Feature emergence via margin maximization: case studies in algebraic tasks
ICLR 2024-01-16 - -
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
ICLR 2024-01-16 - -
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
ICLR 2024-01-16 - -
GitHub Repo stars
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
arXiv 2024-01-03 Github -
GitHub Repo stars
Forbidden Facts: An Investigation of Competing Objectives in Llama-2
ATTRIB@NeurIPS 2023-12-31 Github Blog
GitHub Repo stars
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
arXiv 2023-12-08 Github Blog
GitHub Repo stars
Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching
ATTRIB@NeurIPS 2023-12-06 Github -
GitHub Repo stars
Structured World Representations in Maze-Solving Transformers
UniReps@NeurIPS 2023-12-05 Github -
Generating Interpretable Networks using Hypernetworks
arXiv 2023-12-05 - -
GitHub Repo stars
The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks
NeurIPS 2023-11-21 Github -
GitHub Repo stars
Attribution Patching Outperforms Automated Circuit Discovery
ATTRIB@NeurIPS 2023-11-20 Github -
GitHub Repo stars
Tracr: Compiled Transformers as a Laboratory for Interpretability
NeurIPS 2023-11-03 Github -
GitHub Repo stars
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model
NeurIPS 2023-11-02 Github -
GitHub Repo stars
Learning Transformer Programs
NeurIPS 2023-10-31 Github -
GitHub Repo stars
Towards Automated Circuit Discovery for Mechanistic Interpretability
NeurIPS 2023-10-28 Github -
GitHub Repo stars
Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models
EMNLP 2023-10-23 Github -
GitHub Repo stars
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
NeurIPS 2023-10-20 Github -
GitHub Repo stars
Progress measures for grokking via mechanistic interpretability
ICLR 2023-10-19 Github Blog
GitHub Repo stars
Copy Suppression: Comprehensively Understanding an Attention Head
arXiv 2023-10-06 Github Blog & Demo
GitHub Repo stars
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
NeurIPS 2023-09-21 Github -
GitHub Repo stars
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
NeurIPS 2023-09-21 Github -
GitHub Repo stars
Emergent Linear Representations in World Models of Self-Supervised Sequence Models
BlackboxNLP@EMNLP 2023-09-07 Github Blog
GitHub Repo stars
Finding Neurons in a Haystack: Case Studies with Sparse Probing
arXiv 2023-06-02 Github -
GitHub Repo stars
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
ICML 2023-05-24 Github -
Localizing Model Behavior with Path Patching
arXiv 2023-05-16 - -
Language models can explain neurons in language models
OpenAI 2023-05-09 - -
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models
ICLR Workshop 2023-04-22 - -
GitHub Repo stars
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
ICLR 2023-01-20 Github -
Interpreting Neural Networks through the Polytope Lens
arXiv 2022-11-22 - -
Scaling Laws and Interpretability of Learning from Repeated Data
arXiv 2022-05-21 - -
In-context Learning and Induction Heads
Anthropic 2022-03-08 - -
A Mathematical Framework for Transformer Circuits
Anthropic 2021-12-22 - -

SAE, Dictionary Learning and Superposition

Title Venue Date Code Blog
Interpreting Attention Layer Outputs with Sparse Autoencoders
arXiv 2024-06-25 - Demo
GitHub Repo stars
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
arXiv 2024-05-24 Github -
Improving Language Models Trained with Translated Data via Continual Pre-Training and Dictionary Learning Analysis
arXiv 2024-05-23 - -
Automatically Identifying Local and Global Circuits with Linear Computation Graphs
arXiv 2024-05-22 - -
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Anthropic 2024-05-21 - Demo
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
arXiv 2024-05-21 - -
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
arXiv 2024-05-20 - -
GitHub Repo stars
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
arXiv 2024-05-20 Github -
Improving Dictionary Learning with Gated Sparse Autoencoders
arXiv 2024-04-30 - -
Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
LessWrong 2024-04-29 - Demo
Activation Steering with SAEs
LessWrong 2024-04-19 - -
SAE reconstruction errors are (empirically) pathological
LessWrong 2024-03-29 - -
GitHub Repo stars
Sparse autoencoders find composed features in small toy models
LessWrong 2024-03-14 Github -
GitHub Repo stars
Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT
LessWrong 2024-03-05 Github -
Do sparse autoencoders find "true features"?
LessWrong 2024-02-12 - -
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT
arXiv 2024-02-19 - -
Toward A Mathematical Framework for Computation in Superposition
LessWrong 2024-01-18 - -
Sparse Autoencoders Work on Attention Layer Outputs
LessWrong 2024-01-16 - Demo
GitHub Repo stars
Sparse Autoencoders Find Highly Interpretable Features in Language Models
ICLR 2024-01-16 Github -
GitHub Repo stars
Codebook Features: Sparse and Discrete Interpretability for Neural Networks
arXiv 2023-10-26 Github Demo
GitHub Repo stars
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Anthropic 2023-10-04 Github Demo-1, Demo-2, Tutorial
Polysemanticity and Capacity in Neural Networks
arXiv 2023-07-12 - -
Distributed Representations: Composition & Superposition
Anthropic 2023-05-04 - -
Superposition, Memorization, and Double Descent
Anthropic 2023-01-05 - -
GitHub Repo stars
Engineering Monosemanticity in Toy Models
arXiv 2022-11-16 Github -
GitHub Repo stars
Toy Models of Superposition
Anthropic 2022-09-14 Github Demo
Softmax Linear Units
Anthropic 2022-06-27 - -
GitHub Repo stars
Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors
DeeLIO@NAACL 2021-03-29 Github -
Zoom In: An Introduction to Circuits
Distill 2020-03-10 - -

Interpretability in Vision LLMs

Title Venue Date Code Blog
GitHub Repo stars
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Noise-free Text-Image Corruption and Evaluation
arXiv 2024-06-24 Github -
GitHub Repo stars
PURE: Turning Polysemantic Neurons Into Pure Features by Identifying Relevant Circuits
XAI4CV@CVPR 2024-04-09 Github -
GitHub Repo stars
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
arXiv 2024-02-16 Github -
GitHub Repo stars
Analyzing Vision Transformers for Image Classification in Class Embedding Space
NeurIPS 2023-09-21 Github -
GitHub Repo stars
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP
CLVL@ICCV 2023-08-27 Github -
GitHub Repo stars
Scale Alone Does not Improve Mechanistic Interpretability in Vision Models
NeurIPS 2023-07-11 Github Blog

Benchmarking Interpretability

Title Venue Date Code Blog
Benchmarking Mental State Representations in Language Models
MI@ICML 2024-06-25 - -
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains
ACL 2024-05-21 Dataset Blog
GitHub Repo stars
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
arXiv 2024-02-27 Github -
GitHub Repo stars
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
arXiv 2024-02-19 Github -

Enhancing Interpretability

Title Venue Date Code Blog
Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability
arXiv 2024-01-08 - -
GitHub Repo stars
Seeing is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability
arXiv 2023-06-06 Github -

Others

Title Venue Date Code Blog
An introduction to graphical tensor notation for mechanistic interpretability
arXiv 2024-02-02 - -
GitHub Repo stars
Episodic Memory Theory for the Mechanistic Interpretation of Recurrent Neural Networks
arXiv 2023-10-03 Github -

Other Awesome Interpretability Resources

About

This repository collects all relevant resources about interpretability in LLMs

License:Creative Commons Zero v1.0 Universal