evelynmitchell / magikarp

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Code for the paper "Fishing for Magikarp"

This repository contains the code and extended results for the paper Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models

Exploring Results

The most interesting thing in this repository is probably the detailed reports, found in results/reports.

  • (but not _) is a space, and ¿entry? represents tokens with a vocabulary entry which was not encoded as expected.

Running on other models

Setup

This is a standard poetry project.
poetry shell   # make/activate your virtual environment
poetry install # only the first time or on updates

Running

See run_verification.sh for some example commands for running new models. The script itself is mainly a reference for reproducibility and it is not recommended to run.

For models with tied embeddings, or for nicer visualizations and results, you will need to hard-code some unused token ids in magikarp/unused_tokens.py.

  • If a related model already exists, copying the token ids is likely to work just fine.
  • For non-tied embeddings you can typically just let verification finish, and update unused tokens after you get the results.
  • For the rare case of new model families with tied embeddings:
    • Take a guess, like [0], or use the tokenizer vocabulary to pick some.
    • Run the magikarp/fishing.py script and kill it when it starts verifying.
    • You now have results/verifications/yourmodel.jsonl which allows you to look at the vocabulary and update suitable tokens.
    • Update your unused tokens, and run verification.

Generating results

  • generate_results.py: Generates plots and markdown reports. Typically after finishing verification you should python generate_results.py [your_model_id] and then look in results.

Contributing

If you want to contribute results for additional models, please include:

  • The UNUSED_TOKENS entry
    • ensure tokenization tests (via pytest) pass for the new model, which uses this array as a model registry.
  • A line in run_verification.sh
  • All files in results that are not .gitignore'd

About

License:Apache License 2.0


Languages

Language:Python 87.3%Language:Shell 7.8%Language:Jupyter Notebook 4.9%