SafeAILab / EAGLE

Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)

Home Page:https://arxiv.org/pdf/2406.16858

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Quality loss in greedy mode.

w32zhong opened this issue · comments

commented

In my case, I set the model to be greedy (default ea_generate arguments), EAGLE's human-eval accuracy for LLaMA-7B drops to 5.49 from baseline's 8.54.

I have not yet looked into EAGLE's code very carefully on this issue, I am just curious if anyone has encountered the similar issue?

If there is a bug in the code, would it accidentally improve the efficiency?

We conducted tests and the result files are as follows.
test.zip

In FP32 precision, the output of EAGLE (test/vc7b_fp32-temperature-0.0.jsonl) is completely consistent with the output of Vanilla (test/vc7_fp32_base-temperature-0.0.jsonl) (running test/compare.py), except for the question with id 92. Upon examining the corresponding output, it is found that this inconsistency is caused by different stopping strategies when the maximum length is reached. In FP16 precision, floating-point errors may lead to slight inconsistencies (see Appendix E of Specbench), but this should not result in quality loss.

Your issue may be due to the following reasons:

  1. Different stopping strategies (e.g., different maximum lengths)

  2. Failure to truncate the output correctly (see L240-L262 of EAGLE/eagle/evaluation/gen_ea_answer_vicuna.py)

commented

Thanks for your prompt response. I have the same maximum lengths of 1900. However, my baseline is evaluated in another framework due to the need to compare different systems, but the target models are the same checkpoint.

Are these commands below used to generate the outputs in test.zip?

python -m eagle.evaluation.gen_ea_answer_vicuna \
		 --ea-model-path yuhuili/EAGLE-Vicuna-7B-v1.3 \ 
		 --base-model-path lmsys/vicuna-7b-v1.3

python -m eagle.evaluation.gen_baseline_answer_vicuna \
		 --ea-model-path yuhuili/EAGLE-Vicuna-7B-v1.3 \ 
		 --base-model-path lmsys/vicuna-7b-v1.3

Generating the files in test.zip requires two additional steps.

First, pull the latest code. The recent update roughly unified the maximum generation length. Additionally, change torch_dtype=torch.float16 to torch_dtype=torch.float32 at line 188.

commented

@Liyuhui-12 Thank you so much, I will give it a shot.

@w32zhong Quick question, the accuracy is back to normal now?

commented

@w32zhong Quick question, the accuracy is back to normal now?

I couldn't replicate exact baseline effectiveness scores, But they are very close except for HumanEval.