Can't reproduce HumanEval score

Question

Can't reproduce HumanEval score

geekan opened this issue 9 months ago · comments

follow programming_runs/run_reflexion.sh

get 0.77-0.83 scores for multi trials.

FloridSleeves · Answer 1 · Thu Jan 11 2024 02:18:39 GMT+0800 (China Standard Time)

I cannot reproduce the result either 😭 Is it possible for the author to release the generated tests by GPT4 they use in the experiments?

Noah Shinn · Answer 2 · Thu Jan 11 2024 16:11:04 GMT+0800 (China Standard Time)

Hi @geekan and @FloridSleeves,

As many LLM papers may be experiencing, we are subject to the performance of proprietary models as there is not a better open-source option to evaluate at a high level of performance. We showed some open-source models' results to prove this in the appendix of the recent version of the paper. If you want to use openai's models with Reflexion, I would advise you to use the -0314 suffix to the gpt-4 or gpt-3.5-turbo models to evaluate a model that was checkpointed at a closer time to our experiments. I hope that we will have more open-source options on which we can use reflexion in the future.

Allan Jie · Answer 3 · Sat Mar 02 2024 22:11:22 GMT+0800 (China Standard Time)

I just ran programming_runs/run_reflexion.sh directly, and also got 80% only..

and also the amount of human eval only has 161 (which I guess should be 164)?