Aidan Ewart's repositories
easy-sae-training
Easy training for Sparse Linear Autoencoders (https://arxiv.org/abs/2309.08600) with data from TransformerLens models.
sparse_coding
Work on sparse coding, replicating and extending the sparse coding approach to taking transformer features out of superposition.
analysis_lean
A formalization of my analysis course, in lean.
automated-interpretability-mistral
Getting OpenAI autointerp to work with locally-run (finetuned) Mistral-7B instances.
bijou
Another compiler for a functional programming language, this time hopefully using LLVM in the backend.
group_projects
uni group projects homework (probably mostly bad TeX files)
latent-adverserial-training
Experiments with LAT using activation addition vectors.
lynn
An implementation of a linear type theory with uniqueness types (I think in a similar style to McBride's work, literature is hard)
mechanistic-unlearning
Machine Unlearning via pruning/circuit discovery
othello_world_ppo
Emergent world representations: Exploring a sequence model trained on a synthetic task
Polygraph
RLHF Mechanistic Interpretability and Deception
rlaif-jailbreaking
Self-improving PAIR using RLAIF and MCTS.
sae-alternatives
evaluating alternatives to boring linear sparse autoencoders for latent disentanglement
scratch-transformer
oh no im doing ml
sdl-steering
A collection of experiments trying to evaluate how useful sparse dictionary learning (SDL) methods are for model steering (i.e. identifying 'important components of feature representations').
set-theory-prover
For an AQA A-Level computer science NEA project
soviet-language
forth, but all functions are global to the entire internet
sparse_autoencoder
Sparse Autoencoder for Mechanistic Interpretability
switchcraft-stuff
Random stuff probably for switchcraft.