There are 4 repositories under mechanistic-interpretability topic.
Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms
Interpreting how transformers simulate agents performing RL tasks
🧠 Starter templates for doing interpretability research
Sparse and discrete interpretability tool for neural networks
Sparse probing paper full code.
Explain a black-box module in natural language.
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
Steering vectors for transformer language models in Pytorch / Huggingface
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Universal Neurons in GPT2 Language Models
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research. Open-sourced and constantly updated.
🦠 DeepDecipher: An open source API to MLP neurons
A mechanistic interpretability study invvestigating a sequential model trained to play the board game Othello
Identifying Circuit behind Pronoun Prediction in GPT-2 Small
graphpatch is a library for activation patching on PyTorch neural network models.
This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".
Interpretability on 1-layer Transformer models that converge on the Bayesian-optimal solution for statistical tasks
Starting Kit for the CodaBench competition on Transformer Interpretability
Exploring length generalization in the context of indirect object identification (IOI) task for mechanistic interpretability.
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
Visualising (self)-attention as a vector field: exploring and building intuition. Based on anvaka.github.io/fieldplay.
A replication of "Toy Models of Superposition," a groundbreaking machine learning research paper published by authors affiliated with Anthropic and Harvard in 2022.
Solution to ML assignments from the Alignment Research Engineering Accelerator (ARENA) in-person program
Reversed-engineered Transformer models as a benchmark for interpretability methods
Organizer's repository for the Transformer Interpretability CodaBench competition
A project that simulates a game of shuffling cups with a hidden ball underneath one of them. It also trains a Transformer based deep learning model to predict the final position of the ball after a series of swaps.