apartresearch's repositories
interpretability-starter
π§ Starter templates for doing interpretability research
specificityplus
π©βπ» Code for the ACL paper "Detecting Edit Failures in LLMs: An Improved Specificity Benchmark"
Neuron2Graph
Tools for exploring Transformer neuron behaviour, including input pruning and diversification.
readingwhatwecan
πππππππππ Reading everything
Integer_Addition
β± Understanding the underlying learning dynamics of simple tasks in Transformer networks
aisafetyideas
π‘ The web app CI/CD for aisafetyideas.com
deepdecipher
π¦ DeepDecipher: An open source API to MLP neurons
evaluations-starter
How to get started in evaluations and demonstrations research for dangerous capabilities
ai-psychology-starter
Code templates to get started as an AI psychologist
mechanisticinterpretability
A repository for awesome resources in mechanistic interpretability
AIS-cost-effectiveness
Cost-effectiveness models, tools, and results for various AI safety field-building programs.
othelloscope
Interpretability Hackathon 2.0 entry
scheduling-widget
π Showcases specific times in local time zones
blackbox-psych
Conducting psychology experiments on black box language models
empathetic-ai
π€ A systematic review on how to create empathetic AI
ICML2024MI
π Website for NeurIPS2023MI
safety-timelines
π Research into when alignment is solved
scale-llm-24
π Website for the Scaling Laws workshop
seqcont_circuits
β± Interpreting how similar sequence continuation tasks share internal representations β±
task-standard
π¨ METR Task Standard fork for the Code Red Hackathon
GPT-4-Chat-UI
GPT-4 frontend with open source Next.js template.
hackathon-utils
π Code to run hackathons efficiently
Interpreting-Reward-Models
β± Interpreting implicit reward models learnt in RLHF using sparse autoencoders.
open
π Repository to update our open data
paper-website
π Website template for academic papers
town_hall_avatar
Uses ChatGPT to simulate a townhall discussion between avatars