Wes Gurnee's starred repositories
awesome-neural-geometry
A curated collection of resources and research related to the geometry of representations in the brain, deep networks, and beyond
representation-engineering
Representation Engineering: A Top-Down Approach to AI Transparency
Awesome-Interpretability-in-Large-Language-Models
This repository collects all relevant resources about interpretability in LLMs
world-models
Extracting spatial and temporal world models from LLMs
sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
Awesome-LLM-Interpretability
A curated list of LLM Interpretability related material - Tutorial, Library, Survey, Paper, Blog, etc..
sleeper-agents-paper
Contains random samples referenced in the paper "Sleeper Agents: Training Robustly Deceptive LLMs that Persist Through Safety Training".
sparse-probing-paper
Sparse probing paper full code.
universal-neurons
Universal Neurons in GPT2 Language Models
elk-generalization
Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from easy questions to hard
edge-attribution-patching
Code for my NeurIPS 2024 ATTRIB paper titled "Attribution Patching Outperforms Automated Circuit Discovery"