Baidicoot

followers

following

stars

UK

https://bayesteezian.net/

Organizations

piwars-rgs

Aidan Ewart's repositories

easy-sae-training

Easy training for Sparse Linear Autoencoders (https://arxiv.org/abs/2309.08600) with data from TransformerLens models.

Language:Python600

mini

typed successor to rpncalc

Language:HaskellGPL-3.03 20

sae_alternatives

Language:Jupyter Notebook1 10

sparse_coding

Work on sparse coding, replicating and extending the sparse coding approach to taking transformer features out of superposition.

Language:Jupyter Notebook100

analysis_lean

A formalization of my analysis course, in lean.

Language:Lean000

autocircuit

Language:Jupyter Notebook000

automated-interpretability

000

automated-interpretability-mistral

Getting OpenAI autointerp to work with locally-run (finetuned) Mistral-7B instances.

Language:Python000

bijou

Another compiler for a functional programming language, this time hopefully using LLVM in the backend.

Language:Haskell000

group_projects

uni group projects homework (probably mostly bad TeX files)

Language:TeX000

homework

Language:TeX010

latent-adverserial-training

Experiments with LAT using activation addition vectors.

Apache-2.0000

lynn

An implementation of a linear type theory with uniqueness types (I think in a similar style to McBride's work, literature is hard)

Language:Haskell000

mech-interp-hackery

Language:Jupyter Notebook000

mechanistic-unlearning

Machine Unlearning via pruning/circuit discovery

Apache-2.0000

othello_world_ppo

Emergent world representations: Exploring a sequence model trained on a synthetic task

MIT000

Polygraph

RLHF Mechanistic Interpretability and Deception

MIT000

resnet-deep-double

Language:Python000

rl-test

Language:Python000

rlaif-jailbreaking

Self-improving PAIR using RLAIF and MCTS.

Language:Python000

sae-alternatives

evaluating alternatives to boring linear sparse autoencoders for latent disentanglement

000

scratch-transformer

oh no im doing ml

Language:Python000

sdl-steering

A collection of experiments trying to evaluate how useful sparse dictionary learning (SDL) methods are for model steering (i.e. identifying 'important components of feature representations').

Language:PythonGPL-3.0000

set-theory-prover

For an AQA A-Level computer science NEA project

Language:Haskell000

soviet-language

forth, but all functions are global to the entire internet

Language:JavaScript030

sparse_autoencoder

Sparse Autoencoder for Mechanistic Interpretability

Language:PythonMIT000

switchcraft-stuff

Random stuff probably for switchcraft.

Language:Lua000

TransformerLens

Language:PythonMIT000

transformers_toy_compute_model

Language:Python000

weak-to-strong

Language:PythonMIT000