Reproduce HumanEval Results

Question

Reproduce HumanEval Results

boblee22 opened this issue 2 years ago · comments

Hi, I was trying to reproduce the HumanEval results of CodeGen-16B-Mono. My pass@1 results were significantly worse than that in the paper.

Here are my current results.
temperature 0.2: {'pass@1': 0.15926829268292686, 'pass@10': 0.45424462172394614, 'pass@100': 0.7596262109932597}
temperature 0.6: {'pass@1': 0.1631707317073171, 'pass@10': 0.45846712024390285, 'pass@100': 0.7349539270978501}
temperature 0.8: {'pass@1': 0.16115853658536589, 'pass@10': 0.4522397258878905, 'pass@100': 0.7548574227789276}

I used the checkpoint from https://huggingface.co/Salesforce/codegen-16B-mono and generated 200 completions for each HumanEval problem. The evaluation was run on https://github.com/openai/human-eval.

Here is a code snippet of how I prompted the model.

inputs = tokenizer(problem["prompt"], return_tensors="pt")
canonical_solution = tokenizer(problem["canonical_solution"]).input_ids
input_ids_len = inputs.input_ids.shape[1]
output = model.generate(
    **inputs,
    do_sample=True,
    temp=problem["temp"],
    top_p=0.95,
    max_length=input_ids_len + max(128, len(canonical_solution) + 64),
    pad_token_id=tokenizer.eos_token_id,
)

Could you please give me some guidance to reproduce the paper results?

Thank you!

Hiroaki Hayashi · Answer 1 · Wed Aug 10 2022 23:27:14 GMT+0800 (China Standard Time)

Thanks for your comment. The results look oddly uncorrelated with the temperatures.

From the snippet, here are the differences:

The temperature is specified with temperature, not temp. This may cause ignoring specified temperature values.
We did not use canonical solutions to determine the maximum length. For HumanEval experiments, we used input_ids_len + 512.

Additionally, please verify that you are using the corresponding tokenizer for the model.

If you could share your sampling script somewhere (e.g. gist), I'd be happy to take a look.

boblee22 · Answer 2 · Sat Aug 13 2022 06:28:51 GMT+0800 (China Standard Time)

Just reproduced the paper results! Thank you very much for your help!

temperature 0.2: {‘pass@1’: 0.2910060975609756, ‘pass@10’: 0.4311547000349374, ‘pass@100’: 0.5379728166388384}
temperature 0.6: {‘pass@1’: 0.26408536585365855, ‘pass@10’: 0.5354897815147874, ‘pass@100’: 0.7727023953428368}
temperature 0.8: {‘pass@1’: 0.23518292682926834, ‘pass@10’: 0.528405455080335, ‘pass@100’: 0.7764527488844969}

amd-1221 · Answer 3 · Thu Sep 29 2022 20:26:30 GMT+0800 (China Standard Time)

If it is possible, please share you evaluation script @boblee22