ValueError: Vocabulary has size 32001 but the model expected a vocabulary of size 32000

Question

ValueError: Vocabulary has size 32001 but the model expected a vocabulary of size 32000

silvacarl2 opened this issue 5 months ago · comments

when running this conversion:

ct2-transformers-converter --model WizardLM/WizardLM-13B-V1.2 --quantization float16 --force --output_dir WizardLM-13B-V1-2-float16

got this error message:

ValueError: Vocabulary has size 32001 but the model expected a vocabulary of size 32000

any ideas?

Karandeep singh · Answer 1 · Mon Feb 12 2024 18:22:35 GMT+0800 (China Standard Time)

Any updates, I am facing the same problem!

Minh-Thuc · Answer 2 · Mon Feb 12 2024 18:33:22 GMT+0800 (China Standard Time)

The model accepts vocab size 32000 tokens but when getting tokens, tokenizer get 32000 tokens + 1 token defined in added_tokens.json. This is the cause of the mismatch in the size. I don't know why with a token added, the vocab size in the config is always 32000. There is some quick fix could be done by ignoring the added token but maybe lead the the quality of model.

Minh-Thuc · Answer 3 · Mon Feb 12 2024 18:40:33 GMT+0800 (China Standard Time)

#1621 The fix is here

Karandeep singh · Answer 4 · Tue Feb 13 2024 17:19:24 GMT+0800 (China Standard Time)

@minhthuc2502 It works, thanks!
This change needs to be done for the Mistral config as well. You can also add a warning message and print the token that is going to be truncated. In my case, it is a <sep> token.

Minh-Thuc · Answer 5 · Wed Feb 14 2024 18:15:46 GMT+0800 (China Standard Time)

Thanks for reminding me. Theses models are fine tuned from Llama, so I think the fix on Llama is enough. If there are any model based on Mistral, feel free to open the new issue.