Evaluation benchmarks (lm-eval-harness)

Question

justheuristic opened this issue 2 years ago · comments

Thanks for the awesome work! (and a especially for choosing to make it freely available)

If you have time, please also consider running the evaluation benchmarks from lm-eval-harness
https://github.com/EleutherAI/lm-evaluation-harness

[despite it having a ton of different benchmarks, you only need to implement one interface, and it runs all benchmarks for you]

It is a more-or-less standard tool for benchmarking how well does your model perform on a range of tasks (generation, common sense, math, etc)

There's a huge bunch of tasks, so if you want to choose some initial set, consider taking the ones that gpt-J reports here https://huggingface.co/EleutherAI/gpt-j-6B#evaluation-results