EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.

Home Page:https://www.eleuther.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

llama3 baseline reproduction problem

fmm170 opened this issue · comments

commented

Hello, the results of llama3 at GSM8K reproduced with this framework are quite different from the results reported in the paper (79.6), is it because the shot content of llama3 has changed?

Hi there!

could you share more about what you ran / what scores you got?

On the Instruct Llama3-8B model, gsm8k_cot should be the same prompt that they use as stated in https://github.com/meta-llama/llama3/blob/main/eval_details.md#gsm8k . One difference is you may need to pass --gen_kwargs max_gen_toks=512 since this is mentioned in the linked eval_details.md, as I believe we default to 256 generated tokens maximum.

commented

Snipaste_2024-05-08_15-42-55
Hello! This is the result I got from running on llama3 with task gsm8k and gsm8k_cot. thanks for your reply!

Could you try the suggested changes (higher maximum tokens to generate)? And this is with the instruct model, right?

I also tried with evaluations of llama3-8b-instruct and llama3-70b-instruct. I used default generation configs from generation_config.json in llama3 files:

{
  "bos_token_id": 128000,
  "eos_token_id": [128001, 128009],
  "do_sample": true,
  "temperature": 0.6,
  "max_length": 4096,
  "top_p": 0.9,
  "transformers_version": "4.40.0.dev0"
}

Here are my results on GSM8K:

GSM8K 8-shot Strict-Match Official GSM8K 8-shot-COT
8b-instruct 76.42 79.6
70b-instruct 90.35 93.0

and BBH:

BBH 3-shot-cot Exact-Match Official BBH 3-shot-cot (BASE model)
8b-instruct 63.17 61.1
70b-instruct 49.35 81.3

Please note the performance of 70b-instruct on BBH is pretty low. Fortunately, I found the reason is due to the version of VLLM backend in handling specific two EOS tokens in Llama3, and Transformers patch of relevant EOS stop criteria. Therefore, I modified default vllm version in lm_eval to 0.4.2, and upgrade transformers to 4.40.2. Then I got reasonable results:

GSM8K:

GSM8K 8-shot Strict-Match Official GSM8K 8-shot-COT
8b-instruct 75.44 79.6
70b-instruct 91.05 93.0

BBH:

BBH 3-shot-cot Exact-Match Official BBH 3-shot-cot (BASE model)
8b-instruct 64.6 61.1
70b-instruct 83.38 81.3
commented

Hello! I tried llama3-8b-instruct with max_gen_toks=512 setting in gsm8k_cot with the following results:
Snipaste_2024-05-09_15-02-10
The version of transformers is 4.38.0.thanks for your reply!