Are there reserved/unused tokens for developers?

Question

Are there reserved/unused tokens for developers?

Qubitium opened this issue 4 months ago · comments

Due to BPE vocabulary unable to dynamically expand after training, for finetuning, some BPE tokenizer based models such as Qwen reserved 2k extra unused tokens at the end for developers to use as they see fit.

Does Gemma have a list of internally unused tokens?

Sometimes model makers resize a vocab to a nice gpu-friendly multiple which creates unused tokens or intentially leave some unused tokens such as Qwen.

Surya Bhupatiraju · Answer 1 · Wed Mar 06 2024 00:00:13 GMT+0800 (China Standard Time)

Yes, there are! If you iterate through the vocab, you should find some <unusedXX> tokens. They weren't used for training, but can be used for any other purpose. I think there's around 90 or so of these tokens, let us know if this helps.

Qubitium · Answer 2 · Thu Mar 07 2024 02:24:06 GMT+0800 (China Standard Time)

Thank you! Exactly what we are looking for.

Hrushikesh Pawar · Answer 3 · Thu May 09 2024 18:10:48 GMT+0800 (China Standard Time)

Are there any pointers or guidelines on how we can make use of these <unusedXX> tokens?
How can one make use of them while finetuning?