salesforce / CodeGen

CodeGen is an open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Confusion about inconsistent `eos_token_id`

xingyaoww opened this issue · comments

Hi,

Thanks for releasing this model! It seems that the eos_token_id is consistent between the model config and the pretrained tokenizer. The model config will give 2 as eos_token_id, yet eos_token_id for tokenizer is 50256. I'm wondering which one is correct.

The provided sample.py example also uses 50256 as pad_token_id. pad_token_id should be the same as eos_token_id for GPT, right? So is the correct eos_token_id be 50256?

If we didn't specify eos_token_id as input argument to .generate function in sample.py (but we specify pad_token_id to 50256), it will automatically get eos_token_id from self.config.eos_token_id which is 2 -- is this an issue?

Code to reproduce:

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono")

print(model.config.eos_token_id) # 2
print(tokenizer.eos_token_id) # 50256

Thanks for pointing this out.

The correct eos token is tokenizer.eos_token_id == 50256.

We will fix the model configuration files.

@xingyaoww Model configs are updated:

In [1]: from transformers import AutoTokenizer, AutoModelForCausalLM
   ...: tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono")
   ...: model = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-mono")
   ...:
   ...: print(model.config.eos_token_id)
   ...: print(tokenizer.eos_token_id)
Downloading config.json: 100%|██████████████████████████████████████████████████████|
50256
50256

Thanks again for point this out!

@enijkamp , if Eos token id is changed, accuracy on eval dataset will also get impacted, If true then what about paper mentioned accuracy on human eval dataset?

For the HumanEval benchmark execution, the tokenizer is instantiated explicitly, so that the model configuration file has no effect. See,

def create_custom_gpt2_tokenizer():

Thank you for the consideration!