OpenNMT / CTranslate2

Fast inference engine for Transformer models

Home Page:https://opennmt.net/CTranslate2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ValueError: Vocabulary has size 32001 but the model expected a vocabulary of size 32000

silvacarl2 opened this issue · comments

when running this conversion:

ct2-transformers-converter --model WizardLM/WizardLM-13B-V1.2 --quantization float16 --force --output_dir WizardLM-13B-V1-2-float16

got this error message:

ValueError: Vocabulary has size 32001 but the model expected a vocabulary of size 32000

any ideas?

Any updates, I am facing the same problem!

The model accepts vocab size 32000 tokens but when getting tokens, tokenizer get 32000 tokens + 1 token defined in added_tokens.json. This is the cause of the mismatch in the size. I don't know why with a token added, the vocab size in the config is always 32000. There is some quick fix could be done by ignoring the added token but maybe lead the the quality of model.

#1621 The fix is here

@minhthuc2502 It works, thanks!
This change needs to be done for the Mistral config as well. You can also add a warning message and print the token that is going to be truncated. In my case, it is a <sep> token.

Thanks for reminding me. Theses models are fine tuned from Llama, so I think the fix on Llama is enough. If there are any model based on Mistral, feel free to open the new issue.