llama3 baseline reproduction problem

Question

llama3 baseline reproduction problem

fmm170 opened this issue a month ago · comments

Hello, the results of llama3 at GSM8K reproduced with this framework are quite different from the results reported in the paper (79.6), is it because the shot content of llama3 has changed?

Hailey Schoelkopf · Answer 1 · Wed May 08 2024 00:37:28 GMT+0800 (China Standard Time)

Hi there!

could you share more about what you ran / what scores you got?

On the Instruct Llama3-8B model, gsm8k_cot should be the same prompt that they use as stated in https://github.com/meta-llama/llama3/blob/main/eval_details.md#gsm8k . One difference is you may need to pass --gen_kwargs max_gen_toks=512 since this is mentioned in the linked eval_details.md, as I believe we default to 256 generated tokens maximum.

fmm · Answer 2 · Wed May 08 2024 15:46:23 GMT+0800 (China Standard Time)

Hello! This is the result I got from running on llama3 with task gsm8k and gsm8k_cot. thanks for your reply!

Hailey Schoelkopf · Answer 3 · Wed May 08 2024 19:57:00 GMT+0800 (China Standard Time)

Could you try the suggested changes (higher maximum tokens to generate)? And this is with the instruct model, right?

Lu junru · Answer 4 · Thu May 09 2024 10:11:50 GMT+0800 (China Standard Time)

I also tried with evaluations of llama3-8b-instruct and llama3-70b-instruct. I used default generation configs from generation_config.json in llama3 files:

{
  "bos_token_id": 128000,
  "eos_token_id": [128001, 128009],
  "do_sample": true,
  "temperature": 0.6,
  "max_length": 4096,
  "top_p": 0.9,
  "transformers_version": "4.40.0.dev0"
}

Here are my results on GSM8K:

	GSM8K 8-shot Strict-Match	Official GSM8K 8-shot-COT
8b-instruct	76.42	79.6
70b-instruct	90.35	93.0

and BBH:

	BBH 3-shot-cot Exact-Match	Official BBH 3-shot-cot (BASE model)
8b-instruct	63.17	61.1
70b-instruct	49.35	81.3

Please note the performance of 70b-instruct on BBH is pretty low. Fortunately, I found the reason is due to the version of VLLM backend in handling specific two EOS tokens in Llama3, and Transformers patch of relevant EOS stop criteria. Therefore, I modified default vllm version in lm_eval to 0.4.2, and upgrade transformers to 4.40.2. Then I got reasonable results:

GSM8K:

	GSM8K 8-shot Strict-Match	Official GSM8K 8-shot-COT
8b-instruct	75.44	79.6
70b-instruct	91.05	93.0

BBH:

	BBH 3-shot-cot Exact-Match	Official BBH 3-shot-cot (BASE model)
8b-instruct	64.6	61.1
70b-instruct	83.38	81.3

fmm · Answer 5 · Thu May 09 2024 15:14:28 GMT+0800 (China Standard Time)

Hello! I tried llama3-8b-instruct with max_gen_toks=512 setting in gsm8k_cot with the following results:

The version of transformers is 4.38.0.thanks for your reply!