Repo for BabyLM challenge
-
RoBERTa over entire corpus.
-
Longformer over entire corpus.
-
Evaluation metrics.
-
RoBERTa over entire corpus.
-
Trigram (?) surprisals over entire corpus.
2.1 Construct/rescale surprisals into surprisal vectors (SurprisalSpace)
-
- via some external HMM tool (like hmmlearn)
-
Split into Initial and Pool
-
Longformer over Initial
-
Per-sentence perplexity over Initial.
-
top-k perplexity sentences (Centroids)
-
kNN in SurprisalSpace of Pool -> add to Initial
-
Goto 2.
-
Evaluation metrics.
https://wandb.ai/tony-xudong-hong/huggingface/runs/v9l8f6mg/overview?workspace=user-tony-xudong-hong
January 2023: Training data released
March 2023: Shared evaluation pipeline released
July 15, 2023: Models and results due
August 1, 2023: Paper submissions due
Date TBA: Shared task presented at CoNLL