WikiText 103 evaluation
karpathy opened this issue · comments
I've seen some repos use WikiText-103 as the dataset they use to eval GPT-like models, e.g.:
https://github.com/tysam-code/hlb-gpt/tree/main
Add prepro script to download and preprocess and tokenize WikiText-103 just like tiny shakespeare / tiny stories, following this repo. Adapt the mainline training script train_gpt2.cu to report the validation performance on this set.
Add python code that does the same, evaluates on WikiText-103, and reports performance for all the GPT-2 model sizes. This is our baseline to reach, training from scratch init.
Optionally help research other ways that people have evaluated GPT-2 models, or attempted to reproduce them in the past.
We are abandoning WikiText103 because it's a total mess. We'll instead look at one/few of ARC Easy / Challenge, Squad, Hellaswag, TriviaQA, LAMBADA. Closing.