Reproduce HumanEval Results
boblee22 opened this issue · comments
Hi, I was trying to reproduce the HumanEval results of CodeGen-16B-Mono. My pass@1 results were significantly worse than that in the paper.
Here are my current results.
temperature 0.2: {'pass@1': 0.15926829268292686, 'pass@10': 0.45424462172394614, 'pass@100': 0.7596262109932597}
temperature 0.6: {'pass@1': 0.1631707317073171, 'pass@10': 0.45846712024390285, 'pass@100': 0.7349539270978501}
temperature 0.8: {'pass@1': 0.16115853658536589, 'pass@10': 0.4522397258878905, 'pass@100': 0.7548574227789276}
I used the checkpoint from https://huggingface.co/Salesforce/codegen-16B-mono and generated 200 completions for each HumanEval problem. The evaluation was run on https://github.com/openai/human-eval.
Here is a code snippet of how I prompted the model.
inputs = tokenizer(problem["prompt"], return_tensors="pt")
canonical_solution = tokenizer(problem["canonical_solution"]).input_ids
input_ids_len = inputs.input_ids.shape[1]
output = model.generate(
**inputs,
do_sample=True,
temp=problem["temp"],
top_p=0.95,
max_length=input_ids_len + max(128, len(canonical_solution) + 64),
pad_token_id=tokenizer.eos_token_id,
)
Could you please give me some guidance to reproduce the paper results?
Thank you!
Thanks for your comment. The results look oddly uncorrelated with the temperatures.
From the snippet, here are the differences:
- The temperature is specified with
temperature
, nottemp
. This may cause ignoring specified temperature values. - We did not use canonical solutions to determine the maximum length. For HumanEval experiments, we used
input_ids_len + 512
.
Additionally, please verify that you are using the corresponding tokenizer for the model.
If you could share your sampling script somewhere (e.g. gist), I'd be happy to take a look.
Just reproduced the paper results! Thank you very much for your help!
temperature 0.2: {‘pass@1’: 0.2910060975609756, ‘pass@10’: 0.4311547000349374, ‘pass@100’: 0.5379728166388384}
temperature 0.6: {‘pass@1’: 0.26408536585365855, ‘pass@10’: 0.5354897815147874, ‘pass@100’: 0.7727023953428368}
temperature 0.8: {‘pass@1’: 0.23518292682926834, ‘pass@10’: 0.528405455080335, ‘pass@100’: 0.7764527488844969}