karpathy / llm.c

LLM training in simple, raw C/CUDA

Repository from Github https://github.comkarpathy/llm.cRepository from Github https://github.comkarpathy/llm.c

Deleting Conda/Python as a dependency entirely to dramatically decrease "latency to step"

karpathy opened this issue · comments

commented

Following up on this tweet, copy pasting, and just creating an Issue as a TODO.

"""
The thing that makes this a bit complicated right now is the start latency. What bloats up the setup time right now is the dataset and its tokenization, which is all done in Python right now. Installing huggingface datasets, downloading FineWeb 10B and tokenizing it is currently ~1 hr. I think I have to look into precomputing all of this and just saving the final .bin files (20GB) of tokens somewhere (S3 or so?). You could imagine fetching data shards asynchronously while the training started. This would completely eliminate any Python dependency.

The next slightly annoying thing is cuDNN, which is a 2GB download and installation, just to get the flash attention kernel. And it compiles for 1.5 minutes. But NVIDIA reached out and mentioned they are trying to bring this down a lot.

In principle, the code should compile and run roughly instantaneously.
"""

TLDR I think I'll pre-tokenize FineWeb100B with GPT-2 tokenizer, zip up the .bin shards, and put them up somewhere (e.g. S3?). And then we could just download, unzip, and directly train without any Python involvement at all.

TODO think through a bit.

commented

FineWeb100B is 1010 files total, these are raw .bin shards of 100M tokens each

  • Each is of size 191MB
  • Zipped, each is 134MB

134MB * 1010 files = 135340MB ~= 135GB

Have you played with this streaming parameter ?
load_dataset("HuggingFaceFW/fineweb", name="CC-MAIN-2024-10", split='train', streaming=False,num_proc=28)
I was going to use it but i have already downloaded 500GB of files

commented

(I used streaming originally but then started getting some errors in the tokenization workers when a request randomly fails, so I took it out)

I do something like this not very efficient as i am encoding it on the fly but I am planning to implement a thread that tokenizes and buffers it so it is available readily . https://github.com/banyan-god/llama2.c/blob/master/finewebllama2.py