Evaluation benchmarks (lm-eval-harness)
justheuristic opened this issue · comments
Thanks for the awesome work! (and a especially for choosing to make it freely available)
If you have time, please also consider running the evaluation benchmarks from lm-eval-harness
https://github.com/EleutherAI/lm-evaluation-harness
[despite it having a ton of different benchmarks, you only need to implement one interface, and it runs all benchmarks for you]
It is a more-or-less standard tool for benchmarking how well does your model perform on a range of tasks (generation, common sense, math, etc)
There's a huge bunch of tasks, so if you want to choose some initial set, consider taking the ones that gpt-J reports here https://huggingface.co/EleutherAI/gpt-j-6B#evaluation-results