Sorting task output alphabetically

Question

Sorting task output alphabetically

ad8e opened this issue a month ago · comments

When we run lm-eval, our output looks like this, where the tasks are scrambled every run:

hf (pretrained=meta-llama/Meta-Llama-3-8B), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 8
|    Tasks     |Version|Filter|n-shot|  Metric  |Value |   |Stderr|
|--------------|------:|------|-----:|----------|-----:|---|-----:|
|piqa          |      1|none  |     0|acc       |0.7960|±  |0.0094|
|              |       |none  |     0|acc_norm  |0.8069|±  |0.0092|
|arc_easy      |      1|none  |     0|acc       |0.8013|±  |0.0082|
|              |       |none  |     0|acc_norm  |0.7778|±  |0.0085|
|arc_challenge |      1|none  |     0|acc       |0.5026|±  |0.0146|
|              |       |none  |     0|acc_norm  |0.5341|±  |0.0146|
|winogrande    |      1|none  |     0|acc       |0.7309|±  |0.0125|
|sciq          |      1|none  |     0|acc       |0.9630|±  |0.0060|
|              |       |none  |     0|acc_norm  |0.9390|±  |0.0076|
|boolq         |      2|none  |     0|acc       |0.8113|±  |0.0068|
|hellaswag     |      1|none  |     0|acc       |0.6018|±  |0.0049|
|              |       |none  |     0|acc_norm  |0.7910|±  |0.0041|
|lambada_openai|      1|none  |     0|perplexity|3.0953|±  |0.0571|
|              |       |none  |     0|acc       |0.7557|±  |0.0060|
|openbookqa    |      1|none  |     0|acc       |0.3500|±  |0.0214|
|              |       |none  |     0|acc_norm  |0.4500|±  |0.0223|

This makes it difficult to compare different models, since the tasks must be in the same order. I understand that random sorting creates a bias toward alphabetically-earlier tasks, but the tradeoff would be worth it for an option, since we have to sort them anyway.

Hailey Schoelkopf · Answer 1 · Mon May 06 2024 22:54:59 GMT+0800 (China Standard Time)

Addressing in #1791 !

Kevin Yin · Answer 2 · Mon May 06 2024 23:39:30 GMT+0800 (China Standard Time)

Hooray, thanks!