ValueError: Vocabulary has size 32001 but the model expected a vocabulary of size 32000
silvacarl2 opened this issue · comments
when running this conversion:
ct2-transformers-converter --model WizardLM/WizardLM-13B-V1.2 --quantization float16 --force --output_dir WizardLM-13B-V1-2-float16
got this error message:
ValueError: Vocabulary has size 32001 but the model expected a vocabulary of size 32000
any ideas?
Any updates, I am facing the same problem!
The model accepts vocab size 32000 tokens but when getting tokens, tokenizer
get 32000 tokens + 1 token defined in added_tokens.json
. This is the cause of the mismatch in the size. I don't know why with a token added, the vocab size in the config is always 32000. There is some quick fix could be done by ignoring the added token but maybe lead the the quality of model.
@minhthuc2502 It works, thanks!
This change needs to be done for the Mistral config as well. You can also add a warning message and print the token that is going to be truncated. In my case, it is a <sep>
token.
Thanks for reminding me. Theses models are fine tuned from Llama, so I think the fix on Llama is enough. If there are any model based on Mistral, feel free to open the new issue.