EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.

Home Page:https://www.eleuther.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bad results for LLaMA

juletx opened this issue · comments

I have evaluated LLaMA (7B, 13B and 30B) in most of the tasks available in this library and the results are bad for some tasks. I will give some examples with the 7B model. I haven't checked all the results yet, I put them here so that we can fix the problems that we find. I can share more results and configs if you need more information.

This is the script that I have used for evaluation.

for model_name in "${model_names[@]}"; do
    for group_name in "${tasks_selected[@]}"; do
        srun python3 lm-evaluation-harness/main.py \
            --model hf-causal-experimental \
            --model_args pretrained=$model_name,use_accelerate=True \
            --tasks ${tasks[${group_name}]} \
            --device cuda \
            --output_path results/llama-${model_name:48}_${group_name}_${num_fewshot}-shot.json \
            --batch_size auto \
            --no_cache \
            --num_fewshot ${num_fewshot}
    done
done

Common Sense Reasoning

Results are similar to the paper, generally a bit lower. This is expected because of the differences in prompts. Some exceptions include ARC and openbookqa where the result is much lower.

Task Version Metric Value Stderr
piqa 0 acc 0.7818 ± 0.0096
acc_norm 0.7742 ± 0.0098
wsc273 0 acc 0.8095 ± 0.0238
arc_easy 0 acc 0.6738 ± 0.0096
acc_norm 0.5248 ± 0.0102
hellaswag 0 acc 0.5639 ± 0.0049
acc_norm 0.7298 ± 0.0044
winogrande 0 acc 0.6693 ± 0.0132
prost 0 acc 0.2569 ± 0.0032
acc_norm 0.2803 ± 0.0033
swag 0 acc 0.5547 ± 0.0035
acc_norm 0.6687 ± 0.0033
boolq 1 acc 0.7306 ± 0.0078
arc_challenge 0 acc 0.3823 ± 0.0142
acc_norm 0.4138 ± 0.0144
mc_taco 0 em 0.1126
f1 0.4827
copa 0 acc 0.8400 ± 0.0368
openbookqa 0 acc 0.2820 ± 0.0201
acc_norm 0.4240 ± 0.0221

Mathematical Reasoning

Very low accuracies are obtained, 0 is same cases. GSM8K and MATH results are much lower than in the paper.

Task Version Metric Value Stderr
mathqa 0 acc 0.2677 ± 0.0081
acc_norm 0.2787 ± 0.0082
math_asdiv 0 acc 0.0000 ± 0.0000
gsm8k 0 acc 0.0000 ± 0.0000
math_num_theory 1 acc 0.0074 ± 0.0037
math_precalc 1 acc 0.0037 ± 0.0026
drop 1 em 0.0427 ± 0.0021
f1 0.1216 ± 0.0025
math_geometry 1 acc 0.0084 ± 0.0042
math_counting_and_prob 1 acc 0.0169 ± 0.0059
math_intermediate_algebra 1 acc 0.0066 ± 0.0027
math_prealgebra 1 acc 0.0126 ± 0.0038
math_algebra 1 acc 0.0168 ± 0.0037

Reading Comprehension

RACE results are much lower than on the paper.

Task Version Metric Value Stderr
coqa 1 f1 0.7521 ± 0.0153
em 0.6267 ± 0.0188
drop 1 em 0.0359 ± 0.0019
f1 0.1135 ± 0.0023
race 1 acc 0.3990 ± 0.0152

Question Answering

0 accuracy for TriviaQA and webqs

Task Version Metric Value Stderr
webqs 0 acc 0.0000 ± 0.0000
truthfulqa_mc 1 mc1 0.2105 ± 0.0143
mc2 0.3414 ± 0.0131
headqa_en 0 acc 0.3242 ± 0.0089
acc_norm 0.3592 ± 0.0092
triviaqa 1 acc 0.0000 ± 0.0000
headqa_es 0 acc 0.2826 ± 0.0086
acc_norm 0.3242 ± 0.0089
logiqa 0 acc 0.2181 ± 0.0162
acc_norm 0.3026 ± 0.0180
squad2 1 exact 9.4163
f1 19.4490
HasAns_exact 18.4885
HasAns_f1 38.5827
NoAns_exact 0.3701
NoAns_f1 0.3701
best_exact 50.0716
best_f1 50.0801

LAMBADA

LAMBADA does not work properly, 0 accuracy is obtained.

Task Version Metric Value Stderr
lambada_openai_mt_it 0 ppl 3653680.5734 ± 197082.9861
acc 0.0000 ± 0.0000
lambada_standard 0 ppl 2460346.8573 ± 81216.5655
acc 0.0000 ± 0.0000
lambada_openai_mt_es 0 ppl 3818890.4545 ± 197999.0532
acc 0.0000 ± 0.0000
lambada_openai 0 ppl 2817465.0925 ± 138319.0882
acc 0.0000 ± 0.0000
lambada_openai_mt_fr 0 ppl 2111186.1155 ± 111724.4284
acc 0.0000 ± 0.0000
lambada_openai_mt_de 0 ppl 1805613.6771 ± 97892.7891
acc 0.0000 ± 0.0000
lambada_standard_cloze 0 ppl 6710057.2411 ± 169833.9100
acc 0.0000 ± 0.0000
lambada_openai_mt_en 0 ppl 2817465.0925 ± 138319.0882
acc 0.0000 ± 0.0000
lambada_openai_cloze 0 ppl 255777.7112 ± 11345.7710
acc 0.0004 ± 0.0003

Arithmetic

Another task that returns 0 accuracy.

Task Version Metric Value Stderr
arithmetic_3ds 0 acc 0 ± 0
arithmetic_1dc 0 acc 0 ± 0
arithmetic_2da 0 acc 0 ± 0
arithmetic_4ds 0 acc 0 ± 0
arithmetic_3da 0 acc 0 ± 0
arithmetic_2ds 0 acc 0 ± 0
arithmetic_4da 0 acc 0 ± 0
arithmetic_5ds 0 acc 0 ± 0
arithmetic_2dm 0 acc 0 ± 0
arithmetic_5da 0 acc 0 ± 0

BLIMP

Task Version Metric Value Stderr
blimp_npi_present_2 0 acc 0.530 ± 0.0158
blimp_anaphor_gender_agreement 0 acc 0.448 ± 0.0157
blimp_causative 0 acc 0.508 ± 0.0158
blimp_existential_there_quantifiers_1 0 acc 0.683 ± 0.0147
blimp_existential_there_quantifiers_2 0 acc 0.674 ± 0.0148
blimp_existential_there_subject_raising 0 acc 0.696 ± 0.0146
blimp_principle_A_reconstruction 0 acc 0.673 ± 0.0148
blimp_principle_A_domain_3 0 acc 0.501 ± 0.0158
blimp_sentential_subject_island 0 acc 0.606 ± 0.0155
blimp_superlative_quantifiers_2 0 acc 0.561 ± 0.0157
blimp_complex_NP_island 0 acc 0.416 ± 0.0156
blimp_wh_island 0 acc 0.275 ± 0.0141
blimp_wh_vs_that_no_gap_long_distance 0 acc 0.812 ± 0.0124
blimp_principle_A_c_command 0 acc 0.390 ± 0.0154
blimp_sentential_negation_npi_scope 0 acc 0.588 ± 0.0156
blimp_principle_A_case_2 0 acc 0.554 ± 0.0157
blimp_determiner_noun_agreement_2 0 acc 0.598 ± 0.0155
blimp_left_branch_island_echo_question 0 acc 0.835 ± 0.0117
blimp_wh_vs_that_with_gap_long_distance 0 acc 0.227 ± 0.0133
blimp_determiner_noun_agreement_with_adjective_1 0 acc 0.577 ± 0.0156
blimp_ellipsis_n_bar_1 0 acc 0.668 ± 0.0149
blimp_wh_questions_subject_gap 0 acc 0.720 ± 0.0142
blimp_wh_questions_subject_gap_long_distance 0 acc 0.746 ± 0.0138
blimp_only_npi_scope 0 acc 0.266 ± 0.0140
blimp_coordinate_structure_constraint_complex_left_branch 0 acc 0.682 ± 0.0147
blimp_adjunct_island 0 acc 0.539 ± 0.0158
blimp_determiner_noun_agreement_irregular_1 0 acc 0.572 ± 0.0157
blimp_expletive_it_object_raising 0 acc 0.659 ± 0.0150
blimp_npi_present_1 0 acc 0.534 ± 0.0158
blimp_superlative_quantifiers_1 0 acc 0.612 ± 0.0154
blimp_determiner_noun_agreement_with_adj_2 0 acc 0.540 ± 0.0158
blimp_principle_A_domain_2 0 acc 0.646 ± 0.0151
blimp_irregular_past_participle_adjectives 0 acc 0.429 ± 0.0157
blimp_regular_plural_subject_verb_agreement_1 0 acc 0.645 ± 0.0151
blimp_transitive 0 acc 0.698 ± 0.0145
blimp_existential_there_object_raising 0 acc 0.788 ± 0.0129
blimp_distractor_agreement_relational_noun 0 acc 0.441 ± 0.0157
blimp_animate_subject_passive 0 acc 0.626 ± 0.0153
blimp_sentential_negation_npi_licensor_present 0 acc 0.940 ± 0.0075
blimp_only_npi_licensor_present 0 acc 0.814 ± 0.0123
blimp_irregular_plural_subject_verb_agreement_2 0 acc 0.700 ± 0.0145
blimp_matrix_question_npi_licensor_present 0 acc 0.117 ± 0.0102
blimp_passive_2 0 acc 0.703 ± 0.0145
blimp_tough_vs_raising_2 0 acc 0.768 ± 0.0134
blimp_determiner_noun_agreement_with_adj_irregular_1 0 acc 0.563 ± 0.0157
blimp_drop_argument 0 acc 0.701 ± 0.0145
blimp_wh_vs_that_no_gap 0 acc 0.848 ± 0.0114
blimp_wh_vs_that_with_gap 0 acc 0.239 ± 0.0135
blimp_left_branch_island_simple_question 0 acc 0.740 ± 0.0139
blimp_wh_questions_object_gap 0 acc 0.670 ± 0.0149
blimp_determiner_noun_agreement_1 0 acc 0.636 ± 0.0152
blimp_determiner_noun_agreement_with_adj_irregular_2 0 acc 0.591 ± 0.0156
blimp_tough_vs_raising_1 0 acc 0.298 ± 0.0145
blimp_inchoative 0 acc 0.420 ± 0.0156
blimp_principle_A_case_1 0 acc 0.985 ± 0.0038
blimp_animate_subject_trans 0 acc 0.761 ± 0.0135
blimp_intransitive 0 acc 0.592 ± 0.0155
blimp_anaphor_number_agreement 0 acc 0.659 ± 0.0150
blimp_distractor_agreement_relative_clause 0 acc 0.314 ± 0.0147
blimp_regular_plural_subject_verb_agreement_2 0 acc 0.705 ± 0.0144
blimp_ellipsis_n_bar_2 0 acc 0.794 ± 0.0128
blimp_irregular_plural_subject_verb_agreement_1 0 acc 0.653 ± 0.0151
blimp_principle_A_domain_1 0 acc 0.962 ± 0.0060
blimp_determiner_noun_agreement_irregular_2 0 acc 0.602 ± 0.0155
blimp_coordinate_structure_constraint_object_extraction 0 acc 0.629 ± 0.0153
blimp_passive_1 0 acc 0.702 ± 0.0145
blimp_irregular_past_participle_verbs 0 acc 0.725 ± 0.0141

Human alignment

ETHICS, Toxigen and CrowsPairs

Task Version Metric Value Stderr
ethics_virtue 0 acc 0.2098 ± 0.0058
em 0.0000
crows_pairs_french_race_color 0 likelihood_difference 12.0489 ± 0.7332
pct_stereotype 0.4326 ± 0.0231
ethics_utilitarianism_original 0 acc 0.9586 ± 0.0029
crows_pairs_english_nationality 0 likelihood_difference 6.7626 ± 0.5869
pct_stereotype 0.5370 ± 0.0340
crows_pairs_english_socioeconomic 0 likelihood_difference 6.4016 ± 0.5420
pct_stereotype 0.5684 ± 0.0360
crows_pairs_french_socioeconomic 0 likelihood_difference 9.8084 ± 1.0151
pct_stereotype 0.5204 ± 0.0358
crows_pairs_english_religion 0 likelihood_difference 7.2196 ± 0.7592
pct_stereotype 0.6667 ± 0.0449
ethics_justice 0 acc 0.4996 ± 0.0096
em 0.0015
crows_pairs_english_autre 0 likelihood_difference 11.0114 ± 5.8908
pct_stereotype 0.4545 ± 0.1575
toxigen 0 acc 0.4309 ± 0.0162
acc_norm 0.4319 ± 0.0162
crows_pairs_french_autre 0 likelihood_difference 7.5120 ± 2.0958
pct_stereotype 0.6154 ± 0.1404
ethics_cm 0 acc 0.5691 ± 0.0079
crows_pairs_english_gender 0 likelihood_difference 7.9174 ± 0.5502
pct_stereotype 0.5312 ± 0.0279
crows_pairs_english_race_color 0 likelihood_difference 6.2465 ± 0.3239
pct_stereotype 0.4665 ± 0.0222
crows_pairs_english_age 0 likelihood_difference 5.9423 ± 0.7903
pct_stereotype 0.5165 ± 0.0527
ethics_utilitarianism 0 acc 0.4981 ± 0.0072
crows_pairs_english_sexual_orientation 0 likelihood_difference 8.3048 ± 0.8428
pct_stereotype 0.6237 ± 0.0505
ethics_deontology 0 acc 0.5058 ± 0.0083
em 0.0022
crows_pairs_french_religion 0 likelihood_difference 9.5853 ± 0.8750
pct_stereotype 0.4348 ± 0.0464
crows_pairs_french_gender 0 likelihood_difference 11.7990 ± 0.8714
pct_stereotype 0.5202 ± 0.0279
crows_pairs_french_nationality 0 likelihood_difference 10.4165 ± 0.9066
pct_stereotype 0.4071 ± 0.0309
crows_pairs_english_physical_appearance 0 likelihood_difference 4.5126 ± 0.6932
pct_stereotype 0.5000 ± 0.0593
crows_pairs_french_age 0 likelihood_difference 11.9396 ± 1.5377
pct_stereotype 0.3556 ± 0.0507
crows_pairs_english_disability 0 likelihood_difference 9.6697 ± 1.1386
pct_stereotype 0.6615 ± 0.0591
crows_pairs_french_sexual_orientation 0 likelihood_difference 7.6058 ± 0.7939
pct_stereotype 0.6703 ± 0.0496
crows_pairs_french_physical_appearance 0 likelihood_difference 7.0451 ± 0.9484
pct_stereotype 0.5556 ± 0.0590
crows_pairs_french_disability 0 likelihood_difference 10.1477 ± 1.3907
pct_stereotype 0.4242 ± 0.0613

MMLU

MMLU results seem to be ok.

Task Version Metric Value Stderr
hendrycksTest-high_school_geography 0 acc 0.4293 ± 0.0353
acc_norm 0.3636 ± 0.0343
hendrycksTest-philosophy 0 acc 0.4019 ± 0.0278
acc_norm 0.3537 ± 0.0272
hendrycksTest-world_religions 0 acc 0.6257 ± 0.0371
acc_norm 0.5146 ± 0.0383
hendrycksTest-college_biology 0 acc 0.3194 ± 0.0390
acc_norm 0.2917 ± 0.0380
hendrycksTest-electrical_engineering 0 acc 0.3586 ± 0.0400
acc_norm 0.3241 ± 0.0390
hendrycksTest-global_facts 0 acc 0.3200 ± 0.0469
acc_norm 0.2900 ± 0.0456
hendrycksTest-high_school_government_and_politics 0 acc 0.4819 ± 0.0361
acc_norm 0.3731 ± 0.0349
hendrycksTest-moral_scenarios 0 acc 0.2760 ± 0.0150
acc_norm 0.2726 ± 0.0149
hendrycksTest-econometrics 0 acc 0.2895 ± 0.0427
acc_norm 0.2632 ± 0.0414
hendrycksTest-international_law 0 acc 0.3884 ± 0.0445
acc_norm 0.5785 ± 0.0451
hendrycksTest-us_foreign_policy 0 acc 0.5600 ± 0.0499
acc_norm 0.4500 ± 0.0500
hendrycksTest-high_school_macroeconomics 0 acc 0.3179 ± 0.0236
acc_norm 0.3026 ± 0.0233
hendrycksTest-virology 0 acc 0.3976 ± 0.0381
acc_norm 0.2892 ± 0.0353
hendrycksTest-high_school_mathematics 0 acc 0.2259 ± 0.0255
acc_norm 0.3074 ± 0.0281
hendrycksTest-clinical_knowledge 0 acc 0.3887 ± 0.0300
acc_norm 0.3811 ± 0.0299
hendrycksTest-professional_psychology 0 acc 0.3840 ± 0.0197
acc_norm 0.2990 ± 0.0185
hendrycksTest-formal_logic 0 acc 0.3095 ± 0.0413
acc_norm 0.3492 ± 0.0426
hendrycksTest-management 0 acc 0.4854 ± 0.0495
acc_norm 0.3689 ± 0.0478
hendrycksTest-human_sexuality 0 acc 0.5115 ± 0.0438
acc_norm 0.3664 ± 0.0423
hendrycksTest-high_school_world_history 0 acc 0.3924 ± 0.0318
acc_norm 0.3376 ± 0.0308
hendrycksTest-medical_genetics 0 acc 0.4400 ± 0.0499
acc_norm 0.4000 ± 0.0492
hendrycksTest-computer_security 0 acc 0.3700 ± 0.0485
acc_norm 0.4400 ± 0.0499
hendrycksTest-miscellaneous 0 acc 0.5837 ± 0.0176
acc_norm 0.3895 ± 0.0174
hendrycksTest-public_relations 0 acc 0.3909 ± 0.0467
acc_norm 0.2273 ± 0.0401
hendrycksTest-college_physics 0 acc 0.2353 ± 0.0422
acc_norm 0.3235 ± 0.0466
hendrycksTest-professional_accounting 0 acc 0.3014 ± 0.0274
acc_norm 0.2943 ± 0.0272
hendrycksTest-logical_fallacies 0 acc 0.3804 ± 0.0381
acc_norm 0.3497 ± 0.0375
hendrycksTest-business_ethics 0 acc 0.5300 ± 0.0502
acc_norm 0.4600 ± 0.0501
hendrycksTest-high_school_chemistry 0 acc 0.2512 ± 0.0305
acc_norm 0.2956 ± 0.0321
hendrycksTest-astronomy 0 acc 0.4539 ± 0.0405
acc_norm 0.4605 ± 0.0406
hendrycksTest-high_school_us_history 0 acc 0.4265 ± 0.0347
acc_norm 0.3137 ± 0.0326
hendrycksTest-college_chemistry 0 acc 0.3300 ± 0.0473
acc_norm 0.3000 ± 0.0461
hendrycksTest-abstract_algebra 0 acc 0.2300 ± 0.0423
acc_norm 0.2600 ± 0.0441
hendrycksTest-moral_disputes 0 acc 0.3642 ± 0.0259
acc_norm 0.3324 ± 0.0254
hendrycksTest-college_computer_science 0 acc 0.3300 ± 0.0473
acc_norm 0.2800 ± 0.0451
hendrycksTest-professional_law 0 acc 0.2966 ± 0.0117
acc_norm 0.2855 ± 0.0115
hendrycksTest-college_mathematics 0 acc 0.3200 ± 0.0469
acc_norm 0.3200 ± 0.0469
hendrycksTest-high_school_microeconomics 0 acc 0.3866 ± 0.0316
acc_norm 0.3655 ± 0.0313
hendrycksTest-high_school_european_history 0 acc 0.4061 ± 0.0383
acc_norm 0.3697 ± 0.0377
hendrycksTest-high_school_biology 0 acc 0.3581 ± 0.0273
acc_norm 0.3581 ± 0.0273
hendrycksTest-security_studies 0 acc 0.4082 ± 0.0315
acc_norm 0.3102 ± 0.0296
hendrycksTest-high_school_psychology 0 acc 0.4661 ± 0.0214
acc_norm 0.3083 ± 0.0198
hendrycksTest-conceptual_physics 0 acc 0.3277 ± 0.0307
acc_norm 0.2170 ± 0.0269
hendrycksTest-human_aging 0 acc 0.3722 ± 0.0324
acc_norm 0.2511 ± 0.0291
hendrycksTest-prehistory 0 acc 0.4012 ± 0.0273
acc_norm 0.2778 ± 0.0249
hendrycksTest-sociology 0 acc 0.4776 ± 0.0353
acc_norm 0.4279 ± 0.0350
hendrycksTest-marketing 0 acc 0.6111 ± 0.0319
acc_norm 0.5043 ± 0.0328
hendrycksTest-high_school_computer_science 0 acc 0.4100 ± 0.0494
acc_norm 0.3400 ± 0.0476
hendrycksTest-machine_learning 0 acc 0.3036 ± 0.0436
acc_norm 0.2679 ± 0.0420
hendrycksTest-elementary_mathematics 0 acc 0.3201 ± 0.0240
acc_norm 0.2910 ± 0.0234
hendrycksTest-nutrition 0 acc 0.3954 ± 0.0280
acc_norm 0.4379 ± 0.0284
hendrycksTest-anatomy 0 acc 0.3852 ± 0.0420
acc_norm 0.2815 ± 0.0389
hendrycksTest-jurisprudence 0 acc 0.4352 ± 0.0479
acc_norm 0.5000 ± 0.0483
hendrycksTest-college_medicine 0 acc 0.3757 ± 0.0369
acc_norm 0.3064 ± 0.0351
hendrycksTest-high_school_statistics 0 acc 0.3426 ± 0.0324
acc_norm 0.3426 ± 0.0324
hendrycksTest-high_school_physics 0 acc 0.2053 ± 0.0330
acc_norm 0.2715 ± 0.0363
hendrycksTest-professional_medicine 0 acc 0.3382 ± 0.0287
acc_norm 0.2794 ± 0.0273

@juletx hi I have similar issue, I run several tasks, the results as following:

image

do you have solutions?

No, I don't have solutions

Can look into this! For some tasks, this may not be "fixable" in the sense that we don't know exactly what the LLaMA team did to evaluate, but for others like LAMBADA this is very much not expected.

Yes, I agree. We can't expect exactly the same results because LLaMA prompts are not published. However, tasks where the accuracy is 0 indicate that there might be a problem. LAMBADA is a crear example, but there are more such as math tasks and some QA tasks.

One source of inconsistency is from special token handling in the harness. LLaMA models are trained with BOS tokens, so you probably want to encode with it to give it a "fair" shot. See feature todo:

if (
add_special_tokens is not None
and self.AUTO_MODEL_CLASS is transformers.AutoModelForCausalLM
):
# TODO: Support evaluating causal models with special tokens. Currently,
# this is not possible because the `_loglikelihood_tokens()` method for
# causal LMs makes a no-special-tokens assumption given that contexts
# and labels/continuations are tokenized separately without special
# tokens, concatenated, and then processed as inputs.

Another possibility worth keeping in mind is that the LLaMA implementation in HF could be bugged. I’m not sure how well tested it is against the original codebase, but it’s not an official implementation and (for licensing reasons) had to be written without reference to the original implementation.

For reference, I ran Hellaswag and PiQA on lit-llama (https://github.com/Lightning-AI/lit-llama) and got

hellaswag
acc: 0.5644 ± 0.0049
acc_norm: 0.7306 ± 0.0044
piqa
acc: 0.7840 ± 0.0096
acc_norm: 0.7764 ± 0.0097

This is an independent nanoGPT-based reimplementation of LLaMA, so results are confirmed (slightly higher for lit-llama but that's within uncertainty).

Evaluation for lit-llama on this fork https://github.com/Lightning-AI/lm-evaluation-harness.

It was recently pointed out on Twitter that in the allegedly zero-shot examples they "provide a textual description of the task and a test example." I am comfortable assuming that this explains the discrepancy.

Probably related to Tokenizer issues, solved via specifying tokens: #442

@upunaprosk if correcting the tokenizer solves the problem, it seems like this issue should be opened on the HF transformers repo instead of this one. We are loading the model the way we are told to, it’s just that the transformers library doesn’t know how to load the model.

Can you share your evaluation results with this correction?

Closing because the tokenizer fixes seem to fix most wildly off results. The others, like TriviaQA, have also required some minor modifications to tasks.