karpathy / llm.c

LLM training in simple, raw C/CUDA

Repository from Github https://github.comkarpathy/llm.cRepository from Github https://github.comkarpathy/llm.c

Larger Tokenizers

dustinwloring1988 opened this issue · comments

I would love to train GPT2 with a large BPE tokenizer maybe even with llama3's tokenizer as it has a vocab size of 128K. However this code will not work with a tokenizer that has a large vocab. Is there an easy way to add this