Cannot reproduce WizardCoder-Python series

Question

Cannot reproduce WizardCoder-Python series

yingfhu opened this issue a year ago · comments

We use huggingface WizardLM/WizardCoder-Python-34B-V1.0 but cannot reproduce metrics on humaneval(55 vs 73) using the prompt by your evaluation script(The low accuracy was caused by AssertionError but not invalid grammar)

However,the WizardCoder-xB-V1.0 series can achieve relative persuasive results.

Is there anything different between these two series when evaluating?Thanks.

ChiYeung Law · Answer 1 · Fri Sep 01 2023 14:07:47 GMT+0800 (China Standard Time)

You can follow this tutorial to reproduce the performance step-by-step. If you follow this, you can 100% get the same score.

ChiYeung Law · Answer 2 · Fri Sep 01 2023 14:10:49 GMT+0800 (China Standard Time)

One important point is that do not use transformers > 4.32.0.

wangxu · Answer 3 · Tue Sep 12 2023 17:10:09 GMT+0800 (China Standard Time)

We use huggingface WizardLM/WizardCoder-Python-34B-V1.0 but cannot reproduce metrics on humaneval(55 vs 73) using the prompt by your evaluation script(The low accuracy was caused by AssertionError but not invalid grammar)

However,the WizardCoder-xB-V1.0 series can achieve relative persuasive results.

Is there anything different between these two series when evaluating?Thanks.

@yingfhu Hey, I faced the same problem. Mind sharing how you managed to solve it?

ChiYeung Law · Answer 4 · Tue Sep 12 2023 17:45:59 GMT+0800 (China Standard Time)

You can follow this tutorial to reproduce the performance step-by-step. If you follow this, you can 100% get the same score.

@NinedayWang Check this.

nicoladainese96 · Answer 5 · Sat Sep 16 2023 00:06:08 GMT+0800 (China Standard Time)

Sorry what's the deal with transformers > 4.32.0. I believe that the discrepancy in performance between the WizardCode series based on Starcoder and the one based on LLama comes from how the base model treats padding. Could it be so?