mechanistic-interpretability

There are 4 repositories under mechanistic-interpretability topic.

pyvene
stanfordnlp / pyvene
Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
activation-intervention activation-patching interpretability intervention mechanistic-interpretability
Language:Python 502
pauljblazek / deepdistilling
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms
explainable-ai program-synthesis mechanistic-interpretability inductive-logic-programming model-distillation distilling neurosymbolic domain-adaptation out-of-distribution-generalization interpretable knowledge-distillation
Language:Python 71
jbloomAus / DecisionTransformerInterpretability
Interpreting how transformers simulate agents performing RL tasks
reinforcement-learning mechanistic-interpretability
Language:Jupyter Notebook 55
apartresearch / interpretability-starter
🧠 Starter templates for doing interpretability research
alignment-jam interpretability interpretability-jam mechanistic-interpretability
51
taufeeque9 / codebook-features
Sparse and discrete interpretability tool for neural networks
codebook features interpretability language-model mechanistic-interpretability transformers
Language:Python 37
wesg52 / sparse-probing-paper
Sparse probing paper full code.
ai-alignment ai-safety interpretability mechanistic-interpretability
Language:Jupyter Notebook 37
automated-explanations
microsoft / automated-explanations
Explain a black-box module in natural language.
artificial-intelligence explanation gpt gpt4 interpretability language-model large-language-models machine-learning mechanistic-interpretability neuroscience xai automated-interpretability data-science huggingface fmri fmri-data-analysis
Language:HTML 28
epfl-dlab / llm-latent-language
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
llama2 llm mechanistic-interpretability multilingual-nlp
Language:Jupyter Notebook 26
steering-vectors / steering-vectors
Steering vectors for transformer language models in Pytorch / Huggingface
ai gpt huggingface mechanistic-interpretability nlp pytorch representation-engineering
Language:Python 25
aryamanarora / causalgym
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
benchmark causality interpretability mechanistic-interpretability syntaxgym
Language:Python 23
wesg52 / universal-neurons
Universal Neurons in GPT2 Language Models
ai-safety interpretability llm mechanistic-interpretability
Language:Jupyter Notebook 19
Nix07 / finetuning
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".
entity-tracking finetuning mechanistic-interpretability science-of-deep-learning
Language:Jupyter Notebook 15
koayon / atp_star
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
large-language-models machine-learning mechanistic-interpretability
Language:Python 11
OpenMOSS / Language-Model-SAEs
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research. Open-sourced and constantly updated.
interpretability mechanistic-interpretability
Language:Jupyter Notebook 9
apartresearch / deepdecipher
🦠 DeepDecipher: An open source API to MLP neurons
academic api interpretability interpretability-jam interpretability-methods machine-learning mechanistic-interpretability research website
Language:Rust 8
DeanHazineh / Emergent-World-Representations-Othello
A mechanistic interpretability study invvestigating a sequential model trained to play the board game Othello
gpt-2 intervention mechanistic-interpretability othello-ai
Language:Jupyter Notebook 5
daspartho / pronoun-prediction
Identifying Circuit behind Pronoun Prediction in GPT-2 Small
gpt-2 interpretability mechanistic-interpretability
Language:Jupyter Notebook 2
evan-lloyd / graphpatch
graphpatch is a library for activation patching on PyTorch neural network models.
interpretability large-language-models mechanistic-interpretability pytorch
Language:Python 2
Nix07 / binding-circuit-discovery
This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".
mechanistic-interpretability science-of-deep-learning
Language:Python 2
AlejoAcelas / bayesian-transformers
Interpretability on 1-layer Transformer models that converge on the Bayesian-optimal solution for statistical tasks
bayesian-inference mechanistic-interpretability transformers
Language:Jupyter Notebook 1
AlejoAcelas / Mech-Interp-Challenges
Starting Kit for the CodaBench competition on Transformer Interpretability
competitive-programming mechanistic-interpretability transformer
Language:Python 1
cx0 / mech-interpretability
Exploring length generalization in the context of indirect object identification (IOI) task for mechanistic interpretability.
ioi mechanistic-interpretability indirect-object-identification
Language:Python 1
francescortu / comp-mech
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
interpretability llm mechanistic-interpretability
Language:Python 1
matthiasdellago / visualising-attention
Visualising (self)-attention as a vector field: exploring and building intuition. Based on anvaka.github.io/fieldplay.
attention attention-mechanism machine-learning mechanistic-interpretability transformer vector-field visualization
Language:GLSL 1
zroe1 / toy-models-of-superposition
A replication of "Toy Models of Superposition," a groundbreaking machine learning research paper published by authors affiliated with Anthropic and Harvard in 2022.
machine-learning mechanistic-interpretability python3 pytorch toy-models
Language:Jupyter Notebook 1
AlejoAcelas / ARENA_2.0_Exhibit
Solution to ML assignments from the Alignment Research Engineering Accelerator (ARENA) in-person program
cuda mechanistic-interpretability nlp pytorch rl torch-lightning transformers
Language:Jupyter Notebook 0
AlejoAcelas / Interp-Benchmarks
Reversed-engineered Transformer models as a benchmark for interpretability methods
benchmark causal-analysis mechanistic-interpretability pytorch
Language:Jupyter Notebook 0
AlejoAcelas / Organizer-Mech-Interp-Challenges
Organizer's repository for the Transformer Interpretability CodaBench competition
competitive-programming mechanistic-interpretability transformer
Language:Jupyter Notebook
Lejoon / cup-transformer
A project that simulates a game of shuffling cups with a hidden ball underneath one of them. It also trains a Transformer based deep learning model to predict the final position of the ball after a series of swaps.
deep-learning mechanistic-interpretability transformers
Language:Jupyter Notebook

mechanistic-interpretability

stanfordnlp / pyvene

pauljblazek / deepdistilling

jbloomAus / DecisionTransformerInterpretability

apartresearch / interpretability-starter

taufeeque9 / codebook-features

wesg52 / sparse-probing-paper

microsoft / automated-explanations

epfl-dlab / llm-latent-language

steering-vectors / steering-vectors

aryamanarora / causalgym

wesg52 / universal-neurons

Nix07 / finetuning

koayon / atp_star

OpenMOSS / Language-Model-SAEs

apartresearch / deepdecipher

DeanHazineh / Emergent-World-Representations-Othello

daspartho / pronoun-prediction

evan-lloyd / graphpatch

Nix07 / binding-circuit-discovery

AlejoAcelas / bayesian-transformers

AlejoAcelas / Mech-Interp-Challenges

cx0 / mech-interpretability

francescortu / comp-mech

matthiasdellago / visualising-attention

zroe1 / toy-models-of-superposition

AlejoAcelas / ARENA_2.0_Exhibit

AlejoAcelas / Interp-Benchmarks

AlejoAcelas / Organizer-Mech-Interp-Challenges

Lejoon / cup-transformer