google / gemma_pytorch

The official PyTorch implementation of Google's Gemma models

Home Page:https://ai.google.dev/gemma

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Are there reserved/unused tokens for developers?

Qubitium opened this issue · comments

Due to BPE vocabulary unable to dynamically expand after training, for finetuning, some BPE tokenizer based models such as Qwen reserved 2k extra unused tokens at the end for developers to use as they see fit.

Does Gemma have a list of internally unused tokens?

Sometimes model makers resize a vocab to a nice gpu-friendly multiple which creates unused tokens or intentially leave some unused tokens such as Qwen.

Yes, there are! If you iterate through the vocab, you should find some <unusedXX> tokens. They weren't used for training, but can be used for any other purpose. I think there's around 90 or so of these tokens, let us know if this helps.

Thank you! Exactly what we are looking for.

Are there any pointers or guidelines on how we can make use of these <unusedXX> tokens?
How can one make use of them while finetuning?