What tokenizer is used?
Maykeye opened this issue · comments
Maykeye commented
I can't see mention of tokenizer. Original mamba used
EleutherAI/gpt-neox-20b
which has 50277 entries. And this value is mirrored in original Mamba config.
Here model.config.vocab_size
has size of 50304 and its embedding reflect that as well:
In [56]: model.embedding
Out[56]: Embedding(50304, 1472)
In [57]: len(tokenizer)
Out[57]: 50277
It seems to work with EleutherAI/gpt-neox-20b, but maybe it uses some variant of it.
Quentin Anthony commented
Those are padding tokens to get our vocab size to an efficient number. Underlying GPU GEMM kernels are much more efficient at 50304. Learn more in https://arxiv.org/abs/2401.14489