salesforce / CodeGen

CodeGen is a family of open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reproduce HumanEval Results

boblee22 opened this issue · comments

Hi, I was trying to reproduce the HumanEval results of CodeGen-16B-Mono. My pass@1 results were significantly worse than that in the paper.

Here are my current results.
temperature 0.2: {'pass@1': 0.15926829268292686, 'pass@10': 0.45424462172394614, 'pass@100': 0.7596262109932597}
temperature 0.6: {'pass@1': 0.1631707317073171, 'pass@10': 0.45846712024390285, 'pass@100': 0.7349539270978501}
temperature 0.8: {'pass@1': 0.16115853658536589, 'pass@10': 0.4522397258878905, 'pass@100': 0.7548574227789276}

I used the checkpoint from https://huggingface.co/Salesforce/codegen-16B-mono and generated 200 completions for each HumanEval problem. The evaluation was run on https://github.com/openai/human-eval.

Here is a code snippet of how I prompted the model.

inputs = tokenizer(problem["prompt"], return_tensors="pt")
canonical_solution = tokenizer(problem["canonical_solution"]).input_ids
input_ids_len = inputs.input_ids.shape[1]
output = model.generate(
    **inputs,
    do_sample=True,
    temp=problem["temp"],
    top_p=0.95,
    max_length=input_ids_len + max(128, len(canonical_solution) + 64),
    pad_token_id=tokenizer.eos_token_id,
)

Could you please give me some guidance to reproduce the paper results?

Thank you!

Thanks for your comment. The results look oddly uncorrelated with the temperatures.

From the snippet, here are the differences:

  • The temperature is specified with temperature, not temp. This may cause ignoring specified temperature values.
  • We did not use canonical solutions to determine the maximum length. For HumanEval experiments, we used input_ids_len + 512.

Additionally, please verify that you are using the corresponding tokenizer for the model.

If you could share your sampling script somewhere (e.g. gist), I'd be happy to take a look.

Just reproduced the paper results! Thank you very much for your help!

temperature 0.2: {‘pass@1’: 0.2910060975609756, ‘pass@10’: 0.4311547000349374, ‘pass@100’: 0.5379728166388384}
temperature 0.6: {‘pass@1’: 0.26408536585365855, ‘pass@10’: 0.5354897815147874, ‘pass@100’: 0.7727023953428368}
temperature 0.8: {‘pass@1’: 0.23518292682926834, ‘pass@10’: 0.528405455080335, ‘pass@100’: 0.7764527488844969}

If it is possible, please share you evaluation script @boblee22