Larger Tokenizers

Question

Larger Tokenizers

dustinwloring1988 opened this issue a year ago · comments

I would love to train GPT2 with a large BPE tokenizer maybe even with llama3's tokenizer as it has a vocab size of 128K. However this code will not work with a tokenizer that has a large vocab. Is there an easy way to add this