EleutherAI / lm-evaluation-harness

I have evaluated LLaMA (7B, 13B and 30B) in most of the tasks available in this library and the results are bad for some tasks. I will give some examples with the 7B model. I haven't checked all the results yet, I put them here so that we can fix the problems that we find. I can share more results and configs if you need more information.

This is the script that I have used for evaluation.

for model_name in "${model_names[@]}"; do
    for group_name in "${tasks_selected[@]}"; do
        srun python3 lm-evaluation-harness/main.py \
            --model hf-causal-experimental \
            --model_args pretrained=$model_name,use_accelerate=True \
            --tasks ${tasks[${group_name}]} \
            --device cuda \
            --output_path results/llama-${model_name:48}_${group_name}_${num_fewshot}-shot.json \
            --batch_size auto \
            --no_cache \
            --num_fewshot ${num_fewshot}
    done
done

Common Sense Reasoning

Results are similar to the paper, generally a bit lower. This is expected because of the differences in prompts. Some exceptions include ARC and openbookqa where the result is much lower.

Task	Version	Metric	Value		Stderr
piqa	0	acc	0.7818	±	0.0096
		acc_norm	0.7742	±	0.0098
wsc273	0	acc	0.8095	±	0.0238
arc_easy	0	acc	0.6738	±	0.0096
		acc_norm	0.5248	±	0.0102
hellaswag	0	acc	0.5639	±	0.0049
		acc_norm	0.7298	±	0.0044
winogrande	0	acc	0.6693	±	0.0132
prost	0	acc	0.2569	±	0.0032
		acc_norm	0.2803	±	0.0033
swag	0	acc	0.5547	±	0.0035
		acc_norm	0.6687	±	0.0033
boolq	1	acc	0.7306	±	0.0078
arc_challenge	0	acc	0.3823	±	0.0142
		acc_norm	0.4138	±	0.0144
mc_taco	0	em	0.1126
		f1	0.4827
copa	0	acc	0.8400	±	0.0368
openbookqa	0	acc	0.2820	±	0.0201
		acc_norm	0.4240	±	0.0221

Mathematical Reasoning

Very low accuracies are obtained, 0 is same cases. GSM8K and MATH results are much lower than in the paper.

Task	Version	Metric	Value		Stderr
mathqa	0	acc	0.2677	±	0.0081
		acc_norm	0.2787	±	0.0082
math_asdiv	0	acc	0.0000	±	0.0000
gsm8k	0	acc	0.0000	±	0.0000
math_num_theory	1	acc	0.0074	±	0.0037
math_precalc	1	acc	0.0037	±	0.0026
drop	1	em	0.0427	±	0.0021
		f1	0.1216	±	0.0025
math_geometry	1	acc	0.0084	±	0.0042
math_counting_and_prob	1	acc	0.0169	±	0.0059
math_intermediate_algebra	1	acc	0.0066	±	0.0027
math_prealgebra	1	acc	0.0126	±	0.0038
math_algebra	1	acc	0.0168	±	0.0037

Reading Comprehension

RACE results are much lower than on the paper.

Task	Version	Metric	Value		Stderr
coqa	1	f1	0.7521	±	0.0153
		em	0.6267	±	0.0188
drop	1	em	0.0359	±	0.0019
		f1	0.1135	±	0.0023
race	1	acc	0.3990	±	0.0152

Question Answering

0 accuracy for TriviaQA and webqs

Task	Version	Metric	Value		Stderr
webqs	0	acc	0.0000	±	0.0000
truthfulqa_mc	1	mc1	0.2105	±	0.0143
		mc2	0.3414	±	0.0131
headqa_en	0	acc	0.3242	±	0.0089
		acc_norm	0.3592	±	0.0092
triviaqa	1	acc	0.0000	±	0.0000
headqa_es	0	acc	0.2826	±	0.0086
		acc_norm	0.3242	±	0.0089
logiqa	0	acc	0.2181	±	0.0162
		acc_norm	0.3026	±	0.0180
squad2	1	exact	9.4163
		f1	19.4490
		HasAns_exact	18.4885
		HasAns_f1	38.5827
		NoAns_exact	0.3701
		NoAns_f1	0.3701
		best_exact	50.0716
		best_f1	50.0801

LAMBADA

LAMBADA does not work properly, 0 accuracy is obtained.

Task	Version	Metric	Value		Stderr
lambada_openai_mt_it	0	ppl	3653680.5734	±	197082.9861
		acc	0.0000	±	0.0000
lambada_standard	0	ppl	2460346.8573	±	81216.5655
		acc	0.0000	±	0.0000
lambada_openai_mt_es	0	ppl	3818890.4545	±	197999.0532
		acc	0.0000	±	0.0000
lambada_openai	0	ppl	2817465.0925	±	138319.0882
		acc	0.0000	±	0.0000
lambada_openai_mt_fr	0	ppl	2111186.1155	±	111724.4284
		acc	0.0000	±	0.0000
lambada_openai_mt_de	0	ppl	1805613.6771	±	97892.7891
		acc	0.0000	±	0.0000
lambada_standard_cloze	0	ppl	6710057.2411	±	169833.9100
		acc	0.0000	±	0.0000
lambada_openai_mt_en	0	ppl	2817465.0925	±	138319.0882
		acc	0.0000	±	0.0000
lambada_openai_cloze	0	ppl	255777.7112	±	11345.7710
		acc	0.0004	±	0.0003

Arithmetic

Another task that returns 0 accuracy.

Task	Metric
arithmetic_3ds	acc	±
arithmetic_1dc	acc	±
arithmetic_2da	acc	±
arithmetic_4ds	acc	±
arithmetic_3da	acc	±
arithmetic_2ds	acc	±
arithmetic_4da	acc	±
arithmetic_5ds	acc	±
arithmetic_2dm	acc	±
arithmetic_5da	acc	±

BLIMP

Task	Metric	Value		Stderr
blimp_npi_present_2	acc	0.530	±	0.0158
blimp_anaphor_gender_agreement	acc	0.448	±	0.0157
blimp_causative	acc	0.508	±	0.0158
blimp_existential_there_quantifiers_1	acc	0.683	±	0.0147
blimp_existential_there_quantifiers_2	acc	0.674	±	0.0148
blimp_existential_there_subject_raising	acc	0.696	±	0.0146
blimp_principle_A_reconstruction	acc	0.673	±	0.0148
blimp_principle_A_domain_3	acc	0.501	±	0.0158
blimp_sentential_subject_island	acc	0.606	±	0.0155
blimp_superlative_quantifiers_2	acc	0.561	±	0.0157
blimp_complex_NP_island	acc	0.416	±	0.0156
blimp_wh_island	acc	0.275	±	0.0141
blimp_wh_vs_that_no_gap_long_distance	acc	0.812	±	0.0124
blimp_principle_A_c_command	acc	0.390	±	0.0154
blimp_sentential_negation_npi_scope	acc	0.588	±	0.0156
blimp_principle_A_case_2	acc	0.554	±	0.0157
blimp_determiner_noun_agreement_2	acc	0.598	±	0.0155
blimp_left_branch_island_echo_question	acc	0.835	±	0.0117
blimp_wh_vs_that_with_gap_long_distance	acc	0.227	±	0.0133
blimp_determiner_noun_agreement_with_adjective_1	acc	0.577	±	0.0156
blimp_ellipsis_n_bar_1	acc	0.668	±	0.0149
blimp_wh_questions_subject_gap	acc	0.720	±	0.0142
blimp_wh_questions_subject_gap_long_distance	acc	0.746	±	0.0138
blimp_only_npi_scope	acc	0.266	±	0.0140
blimp_coordinate_structure_constraint_complex_left_branch	acc	0.682	±	0.0147
blimp_adjunct_island	acc	0.539	±	0.0158
blimp_determiner_noun_agreement_irregular_1	acc	0.572	±	0.0157
blimp_expletive_it_object_raising	acc	0.659	±	0.0150
blimp_npi_present_1	acc	0.534	±	0.0158
blimp_superlative_quantifiers_1	acc	0.612	±	0.0154
blimp_determiner_noun_agreement_with_adj_2	acc	0.540	±	0.0158
blimp_principle_A_domain_2	acc	0.646	±	0.0151
blimp_irregular_past_participle_adjectives	acc	0.429	±	0.0157
blimp_regular_plural_subject_verb_agreement_1	acc	0.645	±	0.0151
blimp_transitive	acc	0.698	±	0.0145
blimp_existential_there_object_raising	acc	0.788	±	0.0129
blimp_distractor_agreement_relational_noun	acc	0.441	±	0.0157
blimp_animate_subject_passive	acc	0.626	±	0.0153
blimp_sentential_negation_npi_licensor_present	acc	0.940	±	0.0075
blimp_only_npi_licensor_present	acc	0.814	±	0.0123
blimp_irregular_plural_subject_verb_agreement_2	acc	0.700	±	0.0145
blimp_matrix_question_npi_licensor_present	acc	0.117	±	0.0102
blimp_passive_2	acc	0.703	±	0.0145
blimp_tough_vs_raising_2	acc	0.768	±	0.0134
blimp_determiner_noun_agreement_with_adj_irregular_1	acc	0.563	±	0.0157
blimp_drop_argument	acc	0.701	±	0.0145
blimp_wh_vs_that_no_gap	acc	0.848	±	0.0114
blimp_wh_vs_that_with_gap	acc	0.239	±	0.0135
blimp_left_branch_island_simple_question	acc	0.740	±	0.0139
blimp_wh_questions_object_gap	acc	0.670	±	0.0149
blimp_determiner_noun_agreement_1	acc	0.636	±	0.0152
blimp_determiner_noun_agreement_with_adj_irregular_2	acc	0.591	±	0.0156
blimp_tough_vs_raising_1	acc	0.298	±	0.0145
blimp_inchoative	acc	0.420	±	0.0156
blimp_principle_A_case_1	acc	0.985	±	0.0038
blimp_animate_subject_trans	acc	0.761	±	0.0135
blimp_intransitive	acc	0.592	±	0.0155
blimp_anaphor_number_agreement	acc	0.659	±	0.0150
blimp_distractor_agreement_relative_clause	acc	0.314	±	0.0147
blimp_regular_plural_subject_verb_agreement_2	acc	0.705	±	0.0144
blimp_ellipsis_n_bar_2	acc	0.794	±	0.0128
blimp_irregular_plural_subject_verb_agreement_1	acc	0.653	±	0.0151
blimp_principle_A_domain_1	acc	0.962	±	0.0060
blimp_determiner_noun_agreement_irregular_2	acc	0.602	±	0.0155
blimp_coordinate_structure_constraint_object_extraction	acc	0.629	±	0.0153
blimp_passive_1	acc	0.702	±	0.0145
blimp_irregular_past_participle_verbs	acc	0.725	±	0.0141

Human alignment

ETHICS, Toxigen and CrowsPairs

Task	Version	Metric	Value		Stderr
ethics_virtue	0	acc	0.2098	±	0.0058
		em	0.0000
crows_pairs_french_race_color	0	likelihood_difference	12.0489	±	0.7332
		pct_stereotype	0.4326	±	0.0231
ethics_utilitarianism_original	0	acc	0.9586	±	0.0029
crows_pairs_english_nationality	0	likelihood_difference	6.7626	±	0.5869
		pct_stereotype	0.5370	±	0.0340
crows_pairs_english_socioeconomic	0	likelihood_difference	6.4016	±	0.5420
		pct_stereotype	0.5684	±	0.0360
crows_pairs_french_socioeconomic	0	likelihood_difference	9.8084	±	1.0151
		pct_stereotype	0.5204	±	0.0358
crows_pairs_english_religion	0	likelihood_difference	7.2196	±	0.7592
		pct_stereotype	0.6667	±	0.0449
ethics_justice	0	acc	0.4996	±	0.0096
		em	0.0015
crows_pairs_english_autre	0	likelihood_difference	11.0114	±	5.8908
		pct_stereotype	0.4545	±	0.1575
toxigen	0	acc	0.4309	±	0.0162
		acc_norm	0.4319	±	0.0162
crows_pairs_french_autre	0	likelihood_difference	7.5120	±	2.0958
		pct_stereotype	0.6154	±	0.1404
ethics_cm	0	acc	0.5691	±	0.0079
crows_pairs_english_gender	0	likelihood_difference	7.9174	±	0.5502
		pct_stereotype	0.5312	±	0.0279
crows_pairs_english_race_color	0	likelihood_difference	6.2465	±	0.3239
		pct_stereotype	0.4665	±	0.0222
crows_pairs_english_age	0	likelihood_difference	5.9423	±	0.7903
		pct_stereotype	0.5165	±	0.0527
ethics_utilitarianism	0	acc	0.4981	±	0.0072
crows_pairs_english_sexual_orientation	0	likelihood_difference	8.3048	±	0.8428
		pct_stereotype	0.6237	±	0.0505
ethics_deontology	0	acc	0.5058	±	0.0083
		em	0.0022
crows_pairs_french_religion	0	likelihood_difference	9.5853	±	0.8750
		pct_stereotype	0.4348	±	0.0464
crows_pairs_french_gender	0	likelihood_difference	11.7990	±	0.8714
		pct_stereotype	0.5202	±	0.0279
crows_pairs_french_nationality	0	likelihood_difference	10.4165	±	0.9066
		pct_stereotype	0.4071	±	0.0309
crows_pairs_english_physical_appearance	0	likelihood_difference	4.5126	±	0.6932
		pct_stereotype	0.5000	±	0.0593
crows_pairs_french_age	0	likelihood_difference	11.9396	±	1.5377
		pct_stereotype	0.3556	±	0.0507
crows_pairs_english_disability	0	likelihood_difference	9.6697	±	1.1386
		pct_stereotype	0.6615	±	0.0591
crows_pairs_french_sexual_orientation	0	likelihood_difference	7.6058	±	0.7939
		pct_stereotype	0.6703	±	0.0496
crows_pairs_french_physical_appearance	0	likelihood_difference	7.0451	±	0.9484
		pct_stereotype	0.5556	±	0.0590
crows_pairs_french_disability	0	likelihood_difference	10.1477	±	1.3907
		pct_stereotype	0.4242	±	0.0613

MMLU

MMLU results seem to be ok.

Task	Version	Metric	Value		Stderr
hendrycksTest-high_school_geography	0	acc	0.4293	±	0.0353
		acc_norm	0.3636	±	0.0343
hendrycksTest-philosophy	0	acc	0.4019	±	0.0278
		acc_norm	0.3537	±	0.0272
hendrycksTest-world_religions	0	acc	0.6257	±	0.0371
		acc_norm	0.5146	±	0.0383
hendrycksTest-college_biology	0	acc	0.3194	±	0.0390
		acc_norm	0.2917	±	0.0380
hendrycksTest-electrical_engineering	0	acc	0.3586	±	0.0400
		acc_norm	0.3241	±	0.0390
hendrycksTest-global_facts	0	acc	0.3200	±	0.0469
		acc_norm	0.2900	±	0.0456
hendrycksTest-high_school_government_and_politics	0	acc	0.4819	±	0.0361
		acc_norm	0.3731	±	0.0349
hendrycksTest-moral_scenarios	0	acc	0.2760	±	0.0150
		acc_norm	0.2726	±	0.0149
hendrycksTest-econometrics	0	acc	0.2895	±	0.0427
		acc_norm	0.2632	±	0.0414
hendrycksTest-international_law	0	acc	0.3884	±	0.0445
		acc_norm	0.5785	±	0.0451
hendrycksTest-us_foreign_policy	0	acc	0.5600	±	0.0499
		acc_norm	0.4500	±	0.0500
hendrycksTest-high_school_macroeconomics	0	acc	0.3179	±	0.0236
		acc_norm	0.3026	±	0.0233
hendrycksTest-virology	0	acc	0.3976	±	0.0381
		acc_norm	0.2892	±	0.0353
hendrycksTest-high_school_mathematics	0	acc	0.2259	±	0.0255
		acc_norm	0.3074	±	0.0281
hendrycksTest-clinical_knowledge	0	acc	0.3887	±	0.0300
		acc_norm	0.3811	±	0.0299
hendrycksTest-professional_psychology	0	acc	0.3840	±	0.0197
		acc_norm	0.2990	±	0.0185
hendrycksTest-formal_logic	0	acc	0.3095	±	0.0413
		acc_norm	0.3492	±	0.0426
hendrycksTest-management	0	acc	0.4854	±	0.0495
		acc_norm	0.3689	±	0.0478
hendrycksTest-human_sexuality	0	acc	0.5115	±	0.0438
		acc_norm	0.3664	±	0.0423
hendrycksTest-high_school_world_history	0	acc	0.3924	±	0.0318
		acc_norm	0.3376	±	0.0308
hendrycksTest-medical_genetics	0	acc	0.4400	±	0.0499
		acc_norm	0.4000	±	0.0492
hendrycksTest-computer_security	0	acc	0.3700	±	0.0485
		acc_norm	0.4400	±	0.0499
hendrycksTest-miscellaneous	0	acc	0.5837	±	0.0176
		acc_norm	0.3895	±	0.0174
hendrycksTest-public_relations	0	acc	0.3909	±	0.0467
		acc_norm	0.2273	±	0.0401
hendrycksTest-college_physics	0	acc	0.2353	±	0.0422
		acc_norm	0.3235	±	0.0466
hendrycksTest-professional_accounting	0	acc	0.3014	±	0.0274
		acc_norm	0.2943	±	0.0272
hendrycksTest-logical_fallacies	0	acc	0.3804	±	0.0381
		acc_norm	0.3497	±	0.0375
hendrycksTest-business_ethics	0	acc	0.5300	±	0.0502
		acc_norm	0.4600	±	0.0501
hendrycksTest-high_school_chemistry	0	acc	0.2512	±	0.0305
		acc_norm	0.2956	±	0.0321
hendrycksTest-astronomy	0	acc	0.4539	±	0.0405
		acc_norm	0.4605	±	0.0406
hendrycksTest-high_school_us_history	0	acc	0.4265	±	0.0347
		acc_norm	0.3137	±	0.0326
hendrycksTest-college_chemistry	0	acc	0.3300	±	0.0473
		acc_norm	0.3000	±	0.0461
hendrycksTest-abstract_algebra	0	acc	0.2300	±	0.0423
		acc_norm	0.2600	±	0.0441
hendrycksTest-moral_disputes	0	acc	0.3642	±	0.0259
		acc_norm	0.3324	±	0.0254
hendrycksTest-college_computer_science	0	acc	0.3300	±	0.0473
		acc_norm	0.2800	±	0.0451
hendrycksTest-professional_law	0	acc	0.2966	±	0.0117
		acc_norm	0.2855	±	0.0115
hendrycksTest-college_mathematics	0	acc	0.3200	±	0.0469
		acc_norm	0.3200	±	0.0469
hendrycksTest-high_school_microeconomics	0	acc	0.3866	±	0.0316
		acc_norm	0.3655	±	0.0313
hendrycksTest-high_school_european_history	0	acc	0.4061	±	0.0383
		acc_norm	0.3697	±	0.0377
hendrycksTest-high_school_biology	0	acc	0.3581	±	0.0273
		acc_norm	0.3581	±	0.0273
hendrycksTest-security_studies	0	acc	0.4082	±	0.0315
		acc_norm	0.3102	±	0.0296
hendrycksTest-high_school_psychology	0	acc	0.4661	±	0.0214
		acc_norm	0.3083	±	0.0198
hendrycksTest-conceptual_physics	0	acc	0.3277	±	0.0307
		acc_norm	0.2170	±	0.0269
hendrycksTest-human_aging	0	acc	0.3722	±	0.0324
		acc_norm	0.2511	±	0.0291
hendrycksTest-prehistory	0	acc	0.4012	±	0.0273
		acc_norm	0.2778	±	0.0249
hendrycksTest-sociology	0	acc	0.4776	±	0.0353
		acc_norm	0.4279	±	0.0350
hendrycksTest-marketing	0	acc	0.6111	±	0.0319
		acc_norm	0.5043	±	0.0328
hendrycksTest-high_school_computer_science	0	acc	0.4100	±	0.0494
		acc_norm	0.3400	±	0.0476
hendrycksTest-machine_learning	0	acc	0.3036	±	0.0436
		acc_norm	0.2679	±	0.0420
hendrycksTest-elementary_mathematics	0	acc	0.3201	±	0.0240
		acc_norm	0.2910	±	0.0234
hendrycksTest-nutrition	0	acc	0.3954	±	0.0280
		acc_norm	0.4379	±	0.0284
hendrycksTest-anatomy	0	acc	0.3852	±	0.0420
		acc_norm	0.2815	±	0.0389
hendrycksTest-jurisprudence	0	acc	0.4352	±	0.0479
		acc_norm	0.5000	±	0.0483
hendrycksTest-college_medicine	0	acc	0.3757	±	0.0369
		acc_norm	0.3064	±	0.0351
hendrycksTest-high_school_statistics	0	acc	0.3426	±	0.0324
		acc_norm	0.3426	±	0.0324
hendrycksTest-high_school_physics	0	acc	0.2053	±	0.0330
		acc_norm	0.2715	±	0.0363
hendrycksTest-professional_medicine	0	acc	0.3382	±	0.0287
		acc_norm	0.2794	±	0.0273

@juletx hi I have similar issue, I run several tasks, the results as following:

do you have solutions?

No, I don't have solutions

Can look into this! For some tasks, this may not be "fixable" in the sense that we don't know exactly what the LLaMA team did to evaluate, but for others like LAMBADA this is very much not expected.

Yes, I agree. We can't expect exactly the same results because LLaMA prompts are not published. However, tasks where the accuracy is 0 indicate that there might be a problem. LAMBADA is a crear example, but there are more such as math tasks and some QA tasks.

One source of inconsistency is from special token handling in the harness. LLaMA models are trained with BOS tokens, so you probably want to encode with it to give it a "fair" shot. See feature todo:

lm-evaluation-harness/lm_eval/models/huggingface.py

Lines 147 to 155 in 602abce

    
           if ( 
        
               add_special_tokens is not None 
        
               and self.AUTO_MODEL_CLASS is transformers.AutoModelForCausalLM 
        
           ): 
        
               # TODO: Support evaluating causal models with special tokens. Currently, 
        
               # this is not possible because the `_loglikelihood_tokens()` method for 
        
               # causal LMs makes a no-special-tokens assumption given that contexts 
        
               # and labels/continuations are tokenized separately without special 
        
               # tokens, concatenated, and then processed as inputs.

Another possibility worth keeping in mind is that the LLaMA implementation in HF could be bugged. I’m not sure how well tested it is against the original codebase, but it’s not an official implementation and (for licensing reasons) had to be written without reference to the original implementation.

For reference, I ran Hellaswag and PiQA on lit-llama (https://github.com/Lightning-AI/lit-llama) and got

hellaswag
acc: 0.5644 ± 0.0049
acc_norm: 0.7306 ± 0.0044

piqa
acc: 0.7840 ± 0.0096
acc_norm: 0.7764 ± 0.0097

This is an independent nanoGPT-based reimplementation of LLaMA, so results are confirmed (slightly higher for lit-llama but that's within uncertainty).

Evaluation for lit-llama on this fork https://github.com/Lightning-AI/lm-evaluation-harness.

It was recently pointed out on Twitter that in the allegedly zero-shot examples they "provide a textual description of the task and a test example." I am comfortable assuming that this explains the discrepancy.

Probably related to Tokenizer issues, solved via specifying tokens: #442

@upunaprosk if correcting the tokenizer solves the problem, it seems like this issue should be opened on the HF transformers repo instead of this one. We are loading the model the way we are told to, it’s just that the transformers library doesn’t know how to load the model.

Can you share your evaluation results with this correction?

Closing because the tokenizer fixes seem to fix most wildly off results. The others, like TriviaQA, have also required some minor modifications to tasks.

	if (
	add_special_tokens is not None
	and self.AUTO_MODEL_CLASS is transformers.AutoModelForCausalLM
	):
	# TODO: Support evaluating causal models with special tokens. Currently,
	# this is not possible because the `_loglikelihood_tokens()` method for
	# causal LMs makes a no-special-tokens assumption given that contexts
	# and labels/continuations are tokenized separately without special
	# tokens, concatenated, and then processed as inputs.

Bad results for LLaMA