Cannot reproduce baseline tasks?

Question

Cannot reproduce baseline tasks?

AllenShow opened this issue 4 months ago · comments

Hi! Thanks for your great work.

I tried to reproduce the baseline tasks, but the results were low compared to the paper. So I am not sure whether I used the correct script. Please help me.

For PopQA

python run_baseline_lm.py \
--model_name meta-llama/Llama-2-7b-hf \
--input_file eval_data/popqa_longtail_w_gs.jsonl  \
 --max_new_tokens 100 --metric match \
--result_fp output/test_out_popqa_run_short_form_Llama-2-7b-hf_100 --task qa --prompt_name "prompt_no_input" --world_size 8

overall result: 0.09578270192994996, which is low.

python run_baseline_lm.py \
--model_name meta-llama/Llama-2-7b-hf \
--input_file eval_data/popqa_longtail_w_gs.jsonl \
 --max_new_tokens 100 --metric match \
--result_fp output/test_out_popqa_run_short_form_Llama-2-7b-hf_100_Retrieval-augmented --task qa --mode retrieval --prompt_name "prompt_no_input_retrieval" --world_size 8

overall result: 0.3566833452466047, which is close to the paper. So this result may be correct.

For ARC Challenge

python run_baseline_lm.py \
--model_name meta-llama/Llama-2-7b-hf \
--input_file eval_data/arc_challenge_processed.jsonl \
 --max_new_tokens 50 --metric match \
--result_fp output/test_out_arc_run_short_form_Llama-2-7b-hf_50 --task qa --prompt_name "prompt_no_input" --world_size 8

overall result: 0.11433447098976109, which is low.

python run_baseline_lm.py \
--model_name meta-llama/Llama-2-7b-hf \
--input_file eval_data/arc_challenge_processed.jsonl \
 --max_new_tokens 50 --metric match \
--result_fp output/test_out_arc_run_short_form_Llama-2-7b-hf_50_Retrieval-augmented --task qa --mode retrieval --prompt_name "prompt_no_input_retrieval" --world_size 8

overall result: 0.09044368600682594, which is low.

For PubHealth,

python run_baseline_lm.py \ --model_name meta-llama/Llama-2-7b-hf \ --input_file eval_data/health_claims_processed.jsonl \ --max_new_tokens 50 --metric match \ --result_fp output/test_out_pubhealth_run_short_form_Llama-2-7b-hf_50 --task qa --prompt_name "prompt_no_input" --world_size 8

overall result: 0.0060790273556231, which is low.

python run_baseline_lm.py \
--model_name meta-llama/Llama-2-7b-hf \
--input_file eval_data/health_claims_processed.jsonl \
 --max_new_tokens 100 --metric match \
--result_fp output/test_out_pubhealth_run_short_form_Llama-2-7b-hf_Retrieval-augmented --task qa \
--mode retrieval \
--prompt_name "prompt_no_input_retrieval" --world_size 8

overall result: 0.008105369807497468, which is low.

Gera001 · Answer 1 · Wed Mar 13 2024 23:21:47 GMT+0800 (China Standard Time)

请问现在怎么样了

Jinyang Wu · Answer 2 · Sat Apr 06 2024 11:19:58 GMT+0800 (China Standard Time)

@AllenShow Similar results to you. I also don't know why the score is so slow, even with the relatively imprecise evaluation metrics (match). I observed that the output are always not the exact results, many unrelated outputs, and the temperature, batch_size all influence the final results.

So would you please tell me some more details? Thanks. @AkariAsai

jxjessieli · Answer 3 · Wed Apr 24 2024 17:28:01 GMT+0800 (China Standard Time)

I think you should specify the "task" argument accordingly.

python run_baseline_lm.py --model_name meta-llama/Llama-2-7b-hf --input_file ./eval_data/arc_challenge_processed.jsonl --max_new_tokens 20 --metric match --result_fp ./eval_results/llama2_7b_arcc_results.json --task arc_c
overall results: 0.28498293515358364
This result is higher than the one reported in the paper.

python run_baseline_lm.py --model_name meta-llama/Llama-2-7b-hf --input_file ./eval_data/health_claims_processed.jsonl --max_new_tokens 20 --metric match --result_fp ./eval_results/llama2_7b_pubhealth_results.json --task fever
overall results: 0.1702127659574468
This result is significantly lower than the one reported in the paper.

It seems to me that the generated content is noisy and the author did not do enough post-processing. For example pred="True\n" is marked wrong if label="true". I am not sure if that's the issue. Please advise on how to reproduce the baseline results and correct me if I'm wrong. @AkariAsai