AkariAsai / self-rag

This includes the original implementation of SELF-RAG: Learning to Retrieve, Generate and Critique through self-reflection by Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi.

Home Page:https://selfrag.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot reproduce baseline tasks?

AllenShow opened this issue · comments

Hi! Thanks for your great work.

I tried to reproduce the baseline tasks, but the results were low compared to the paper. So I am not sure whether I used the correct script. Please help me.

For PopQA

python run_baseline_lm.py \
--model_name meta-llama/Llama-2-7b-hf \
--input_file eval_data/popqa_longtail_w_gs.jsonl  \
 --max_new_tokens 100 --metric match \
--result_fp output/test_out_popqa_run_short_form_Llama-2-7b-hf_100 --task qa --prompt_name "prompt_no_input" --world_size 8

overall result: 0.09578270192994996, which is low.

python run_baseline_lm.py \
--model_name meta-llama/Llama-2-7b-hf \
--input_file eval_data/popqa_longtail_w_gs.jsonl \
 --max_new_tokens 100 --metric match \
--result_fp output/test_out_popqa_run_short_form_Llama-2-7b-hf_100_Retrieval-augmented --task qa --mode retrieval --prompt_name "prompt_no_input_retrieval" --world_size 8

overall result: 0.3566833452466047, which is close to the paper. So this result may be correct.

For ARC Challenge

python run_baseline_lm.py \
--model_name meta-llama/Llama-2-7b-hf \
--input_file eval_data/arc_challenge_processed.jsonl \
 --max_new_tokens 50 --metric match \
--result_fp output/test_out_arc_run_short_form_Llama-2-7b-hf_50 --task qa --prompt_name "prompt_no_input" --world_size 8

overall result: 0.11433447098976109, which is low.

python run_baseline_lm.py \
--model_name meta-llama/Llama-2-7b-hf \
--input_file eval_data/arc_challenge_processed.jsonl \
 --max_new_tokens 50 --metric match \
--result_fp output/test_out_arc_run_short_form_Llama-2-7b-hf_50_Retrieval-augmented --task qa --mode retrieval --prompt_name "prompt_no_input_retrieval" --world_size 8

overall result: 0.09044368600682594, which is low.

For PubHealth,

python run_baseline_lm.py \ --model_name meta-llama/Llama-2-7b-hf \ --input_file eval_data/health_claims_processed.jsonl \ --max_new_tokens 50 --metric match \ --result_fp output/test_out_pubhealth_run_short_form_Llama-2-7b-hf_50 --task qa --prompt_name "prompt_no_input" --world_size 8

overall result: 0.0060790273556231, which is low.

python run_baseline_lm.py \
--model_name meta-llama/Llama-2-7b-hf \
--input_file eval_data/health_claims_processed.jsonl \
 --max_new_tokens 100 --metric match \
--result_fp output/test_out_pubhealth_run_short_form_Llama-2-7b-hf_Retrieval-augmented --task qa \
--mode retrieval \
--prompt_name "prompt_no_input_retrieval" --world_size 8

overall result: 0.008105369807497468, which is low.

请问现在怎么样了

@AllenShow Similar results to you. I also don't know why the score is so slow, even with the relatively imprecise evaluation metrics (match). I observed that the output are always not the exact results, many unrelated outputs, and the temperature, batch_size all influence the final results.

So would you please tell me some more details? Thanks. @AkariAsai

I think you should specify the "task" argument accordingly.

python run_baseline_lm.py --model_name meta-llama/Llama-2-7b-hf --input_file ./eval_data/arc_challenge_processed.jsonl --max_new_tokens 20 --metric match --result_fp ./eval_results/llama2_7b_arcc_results.json --task arc_c
overall results: 0.28498293515358364
This result is higher than the one reported in the paper.

python run_baseline_lm.py --model_name meta-llama/Llama-2-7b-hf --input_file ./eval_data/health_claims_processed.jsonl --max_new_tokens 20 --metric match --result_fp ./eval_results/llama2_7b_pubhealth_results.json --task fever
overall results: 0.1702127659574468
This result is significantly lower than the one reported in the paper.

It seems to me that the generated content is noisy and the author did not do enough post-processing. For example pred="True\n" is marked wrong if label="true". I am not sure if that's the issue. Please advise on how to reproduce the baseline results and correct me if I'm wrong. @AkariAsai