asayeed / ActiveBaby

BabyLM challenge code

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

README

ActiveBaby

Repo for BabyLM challenge

step by step process

BASELINE

  1. RoBERTa over entire corpus.

  2. Longformer over entire corpus.

  3. Evaluation metrics.

FANCIER

  1. RoBERTa over entire corpus.

  2. Trigram (?) surprisals over entire corpus.

2.1 Construct/rescale surprisals into surprisal vectors (SurprisalSpace)

    • via some external HMM tool (like hmmlearn)
  1. Split into Initial and Pool

  2. Longformer over Initial

  3. Per-sentence perplexity over Initial.

  4. top-k perplexity sentences (Centroids)

  5. kNN in SurprisalSpace of Pool -> add to Initial

  6. Goto 2.

  7. Evaluation metrics.

Monitor of training process

https://wandb.ai/tony-xudong-hong/huggingface/runs/v9l8f6mg/overview?workspace=user-tony-xudong-hong

Timeline

January 2023: Training data released

March 2023: Shared evaluation pipeline released

July 15, 2023: Models and results due

August 1, 2023: Paper submissions due

Date TBA: Shared task presented at CoNLL

About

BabyLM challenge code

License:Apache License 2.0


Languages

Language:Jupyter Notebook 89.7%Language:Python 10.3%