Quality loss in greedy mode.

Question

Quality loss in greedy mode.

w32zhong opened this issue 4 months ago · comments

In my case, I set the model to be greedy (default ea_generate arguments), EAGLE's human-eval accuracy for LLaMA-7B drops to 5.49 from baseline's 8.54.

I have not yet looked into EAGLE's code very carefully on this issue, I am just curious if anyone has encountered the similar issue?

If there is a bug in the code, would it accidentally improve the efficiency?

yuhuili · Answer 1 · Sun May 26 2024 20:06:13 GMT+0800 (China Standard Time)

We conducted tests and the result files are as follows.
test.zip

In FP32 precision, the output of EAGLE (test/vc7b_fp32-temperature-0.0.jsonl) is completely consistent with the output of Vanilla (test/vc7_fp32_base-temperature-0.0.jsonl) (running test/compare.py), except for the question with id 92. Upon examining the corresponding output, it is found that this inconsistency is caused by different stopping strategies when the maximum length is reached. In FP16 precision, floating-point errors may lead to slight inconsistencies (see Appendix E of Specbench), but this should not result in quality loss.

Your issue may be due to the following reasons:

Different stopping strategies (e.g., different maximum lengths)
Failure to truncate the output correctly (see L240-L262 of EAGLE/eagle/evaluation/gen_ea_answer_vicuna.py)

Wei · Answer 2 · Sun May 26 2024 23:40:27 GMT+0800 (China Standard Time)

Thanks for your prompt response. I have the same maximum lengths of 1900. However, my baseline is evaluated in another framework due to the need to compare different systems, but the target models are the same checkpoint.

Are these commands below used to generate the outputs in test.zip?

python -m eagle.evaluation.gen_ea_answer_vicuna \
		 --ea-model-path yuhuili/EAGLE-Vicuna-7B-v1.3 \ 
		 --base-model-path lmsys/vicuna-7b-v1.3

python -m eagle.evaluation.gen_baseline_answer_vicuna \
		 --ea-model-path yuhuili/EAGLE-Vicuna-7B-v1.3 \ 
		 --base-model-path lmsys/vicuna-7b-v1.3

yuhuili · Answer 3 · Mon May 27 2024 00:23:46 GMT+0800 (China Standard Time)

Generating the files in test.zip requires two additional steps.

First, pull the latest code. The recent update roughly unified the maximum generation length. Additionally, change torch_dtype=torch.float16 to torch_dtype=torch.float32 at line 188.

Wei · Answer 4 · Mon May 27 2024 04:38:01 GMT+0800 (China Standard Time)

@Liyuhui-12 Thank you so much, I will give it a shot.

Yanjun Zhou · Answer 5 · Wed Jul 24 2024 11:50:57 GMT+0800 (China Standard Time)

@w32zhong Quick question, the accuracy is back to normal now?

Wei · Answer 6 · Thu Jul 25 2024 21:02:02 GMT+0800 (China Standard Time)

@w32zhong Quick question, the accuracy is back to normal now?

I couldn't replicate exact baseline effectiveness scores, But they are very close except for HumanEval.