EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.

Home Page:https://www.eleuther.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sorting task output alphabetically

ad8e opened this issue · comments

When we run lm-eval, our output looks like this, where the tasks are scrambled every run:

hf (pretrained=meta-llama/Meta-Llama-3-8B), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 8
|    Tasks     |Version|Filter|n-shot|  Metric  |Value |   |Stderr|
|--------------|------:|------|-----:|----------|-----:|---|-----:|
|piqa          |      1|none  |     0|acc       |0.7960|±  |0.0094|
|              |       |none  |     0|acc_norm  |0.8069|±  |0.0092|
|arc_easy      |      1|none  |     0|acc       |0.8013|±  |0.0082|
|              |       |none  |     0|acc_norm  |0.7778|±  |0.0085|
|arc_challenge |      1|none  |     0|acc       |0.5026|±  |0.0146|
|              |       |none  |     0|acc_norm  |0.5341|±  |0.0146|
|winogrande    |      1|none  |     0|acc       |0.7309|±  |0.0125|
|sciq          |      1|none  |     0|acc       |0.9630|±  |0.0060|
|              |       |none  |     0|acc_norm  |0.9390|±  |0.0076|
|boolq         |      2|none  |     0|acc       |0.8113|±  |0.0068|
|hellaswag     |      1|none  |     0|acc       |0.6018|±  |0.0049|
|              |       |none  |     0|acc_norm  |0.7910|±  |0.0041|
|lambada_openai|      1|none  |     0|perplexity|3.0953|±  |0.0571|
|              |       |none  |     0|acc       |0.7557|±  |0.0060|
|openbookqa    |      1|none  |     0|acc       |0.3500|±  |0.0214|
|              |       |none  |     0|acc_norm  |0.4500|±  |0.0223|

This makes it difficult to compare different models, since the tasks must be in the same order. I understand that random sorting creates a bias toward alphabetically-earlier tasks, but the tradeoff would be worth it for an option, since we have to sort them anyway.

Addressing in #1791 !

Hooray, thanks!