Sorting task output alphabetically
ad8e opened this issue · comments
Kevin Yin commented
When we run lm-eval
, our output looks like this, where the tasks are scrambled every run:
hf (pretrained=meta-llama/Meta-Llama-3-8B), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 8
| Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
|--------------|------:|------|-----:|----------|-----:|---|-----:|
|piqa | 1|none | 0|acc |0.7960|± |0.0094|
| | |none | 0|acc_norm |0.8069|± |0.0092|
|arc_easy | 1|none | 0|acc |0.8013|± |0.0082|
| | |none | 0|acc_norm |0.7778|± |0.0085|
|arc_challenge | 1|none | 0|acc |0.5026|± |0.0146|
| | |none | 0|acc_norm |0.5341|± |0.0146|
|winogrande | 1|none | 0|acc |0.7309|± |0.0125|
|sciq | 1|none | 0|acc |0.9630|± |0.0060|
| | |none | 0|acc_norm |0.9390|± |0.0076|
|boolq | 2|none | 0|acc |0.8113|± |0.0068|
|hellaswag | 1|none | 0|acc |0.6018|± |0.0049|
| | |none | 0|acc_norm |0.7910|± |0.0041|
|lambada_openai| 1|none | 0|perplexity|3.0953|± |0.0571|
| | |none | 0|acc |0.7557|± |0.0060|
|openbookqa | 1|none | 0|acc |0.3500|± |0.0214|
| | |none | 0|acc_norm |0.4500|± |0.0223|
This makes it difficult to compare different models, since the tasks must be in the same order. I understand that random sorting creates a bias toward alphabetically-earlier tasks, but the tradeoff would be worth it for an option, since we have to sort them anyway.
Hailey Schoelkopf commented
Addressing in #1791 !
Kevin Yin commented
Hooray, thanks!