llama3 baseline reproduction problem
fmm170 opened this issue · comments
Hello, the results of llama3 at GSM8K reproduced with this framework are quite different from the results reported in the paper (79.6), is it because the shot content of llama3 has changed?
Hi there!
could you share more about what you ran / what scores you got?
On the Instruct Llama3-8B model, gsm8k_cot
should be the same prompt that they use as stated in https://github.com/meta-llama/llama3/blob/main/eval_details.md#gsm8k . One difference is you may need to pass --gen_kwargs max_gen_toks=512
since this is mentioned in the linked eval_details.md, as I believe we default to 256 generated tokens maximum.
Could you try the suggested changes (higher maximum tokens to generate)? And this is with the instruct model, right?
I also tried with evaluations of llama3-8b-instruct and llama3-70b-instruct. I used default generation configs from generation_config.json in llama3 files:
{
"bos_token_id": 128000,
"eos_token_id": [128001, 128009],
"do_sample": true,
"temperature": 0.6,
"max_length": 4096,
"top_p": 0.9,
"transformers_version": "4.40.0.dev0"
}
Here are my results on GSM8K:
GSM8K 8-shot Strict-Match | Official GSM8K 8-shot-COT | |
---|---|---|
8b-instruct | 76.42 | 79.6 |
70b-instruct | 90.35 | 93.0 |
and BBH:
BBH 3-shot-cot Exact-Match | Official BBH 3-shot-cot (BASE model) | |
---|---|---|
8b-instruct | 63.17 | 61.1 |
70b-instruct | 49.35 | 81.3 |
Please note the performance of 70b-instruct on BBH is pretty low. Fortunately, I found the reason is due to the version of VLLM backend in handling specific two EOS tokens in Llama3, and Transformers patch of relevant EOS stop criteria. Therefore, I modified default vllm version in lm_eval to 0.4.2, and upgrade transformers to 4.40.2. Then I got reasonable results:
GSM8K:
GSM8K 8-shot Strict-Match | Official GSM8K 8-shot-COT | |
---|---|---|
8b-instruct | 75.44 | 79.6 |
70b-instruct | 91.05 | 93.0 |
BBH:
BBH 3-shot-cot Exact-Match | Official BBH 3-shot-cot (BASE model) | |
---|---|---|
8b-instruct | 64.6 | 61.1 |
70b-instruct | 83.38 | 81.3 |