What tokenizer is used?

Question

What tokenizer is used?

Maykeye opened this issue 6 months ago · comments

I can't see mention of tokenizer. Original mamba used
EleutherAI/gpt-neox-20b which has 50277 entries. And this value is mirrored in original Mamba config.
Here model.config.vocab_size has size of 50304 and its embedding reflect that as well:

In [56]: model.embedding
Out[56]: Embedding(50304, 1472)

In [57]: len(tokenizer)
Out[57]: 50277

It seems to work with EleutherAI/gpt-neox-20b, but maybe it uses some variant of it.

Quentin Anthony · Answer 1 · Wed Feb 07 2024 02:22:42 GMT+0800 (China Standard Time)

Those are padding tokens to get our vocab size to an efficient number. Underlying GPU GEMM kernels are much more efficient at 50304. Learn more in https://arxiv.org/abs/2401.14489