WikiText 103 evaluation

Question

WikiText 103 evaluation

karpathy opened this issue 2 years ago · comments

I've seen some repos use WikiText-103 as the dataset they use to eval GPT-like models, e.g.:

https://github.com/tysam-code/hlb-gpt/tree/main

Add prepro script to download and preprocess and tokenize WikiText-103 just like tiny shakespeare / tiny stories, following this repo. Adapt the mainline training script train_gpt2.cu to report the validation performance on this set.

Add python code that does the same, evaluates on WikiText-103, and reports performance for all the GPT-2 model sizes. This is our baseline to reach, training from scratch init.

Optionally help research other ways that people have evaluated GPT-2 models, or attempted to reproduce them in the past.

Andrej · Answer 1 · Fri May 17 2024 06:47:02 GMT+0800 (China Standard Time)

We are abandoning WikiText103 because it's a total mess. We'll instead look at one/few of ARC Easy / Challenge, Squad, Hellaswag, TriviaQA, LAMBADA. Closing.