Zyphra / BlackMamba

Code repository for Black Mamba

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What tokenizer is used?

Maykeye opened this issue · comments

I can't see mention of tokenizer. Original mamba used
EleutherAI/gpt-neox-20b which has 50277 entries. And this value is mirrored in original Mamba config.
Here model.config.vocab_size has size of 50304 and its embedding reflect that as well:

In [56]: model.embedding
Out[56]: Embedding(50304, 1472)

In [57]: len(tokenizer)
Out[57]: 50277

It seems to work with EleutherAI/gpt-neox-20b, but maybe it uses some variant of it.

Those are padding tokens to get our vocab size to an efficient number. Underlying GPU GEMM kernels are much more efficient at 50304. Learn more in https://arxiv.org/abs/2401.14489