nlpxucan / WizardLM

LLMs build upon Evol Insturct: WizardLM, WizardCoder, WizardMath

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot reproduce WizardCoder-Python series

yingfhu opened this issue · comments

commented

We use huggingface WizardLM/WizardCoder-Python-34B-V1.0 but cannot reproduce metrics on humaneval(55 vs 73) using the prompt by your evaluation script(The low accuracy was caused by AssertionError but not invalid grammar)

However,the WizardCoder-xB-V1.0 series can achieve relative persuasive results.

Is there anything different between these two series when evaluating?Thanks.

You can follow this tutorial to reproduce the performance step-by-step. If you follow this, you can 100% get the same score.

One important point is that do not use transformers > 4.32.0.

commented

We use huggingface WizardLM/WizardCoder-Python-34B-V1.0 but cannot reproduce metrics on humaneval(55 vs 73) using the prompt by your evaluation script(The low accuracy was caused by AssertionError but not invalid grammar)

However,the WizardCoder-xB-V1.0 series can achieve relative persuasive results.

Is there anything different between these two series when evaluating?Thanks.

@yingfhu Hey, I faced the same problem. Mind sharing how you managed to solve it?

You can follow this tutorial to reproduce the performance step-by-step. If you follow this, you can 100% get the same score.

@NinedayWang Check this.

Sorry what's the deal with transformers > 4.32.0. I believe that the discrepancy in performance between the WizardCode series based on Starcoder and the one based on LLama comes from how the base model treats padding. Could it be so?