asayeed / ActiveBaby

BabyLM challenge code

README

ActiveBaby

Repo for BabyLM challenge

step by step process

BASELINE

RoBERTa over entire corpus.
Longformer over entire corpus.
Evaluation metrics.

FANCIER

RoBERTa over entire corpus.
Trigram (?) surprisals over entire corpus.

2.1 Construct/rescale surprisals into surprisal vectors (SurprisalSpace)

- via some external HMM tool (like hmmlearn)

Split into Initial and Pool
Longformer over Initial
Per-sentence perplexity over Initial.
top-k perplexity sentences (Centroids)
kNN in SurprisalSpace of Pool -> add to Initial
Goto 2.
Evaluation metrics.

Monitor of training process

https://wandb.ai/tony-xudong-hong/huggingface/runs/v9l8f6mg/overview?workspace=user-tony-xudong-hong

Timeline

~~January 2023: Training data released~~

~~March 2023: Shared evaluation pipeline released~~

July 15, 2023: Models and results due

August 1, 2023: Paper submissions due

Date TBA: Shared task presented at CoNLL

About

BabyLM challenge code

Apache License 2.0

Languages

Language:Jupyter Notebook 89.7%Language:Python 10.3%