mosaicml / llm-foundry

In the convert_dataset_hf.py script, the number of raw samples for C4 is set as 364_868_892

llm-foundry/scripts/data_prep/convert_dataset_hf.py

Line 170 in 6afd446

raw_samples=364868892,

Where is this number coming from? According to HuggingFace hub, C4 train split has 2.21M rows. https://huggingface.co/datasets/c4

Edit: seems like this number is given in the DataSplit section https://huggingface.co/datasets/c4#data-splits. So let me rephrase my question: How do we determine this number for a new dataset that we want to process with llmfoundry? For example SlimPajama (https://huggingface.co/datasets/cerebras/SlimPajama-627B) doesn't have this number stated anywhere.

Hey Eldar, that hardcoded number is solely used for the progress bar in that script, you can definitely proceed without defining it. Would have to dig a bit to figure out where that sample count came from, but hopefully that unblocks you.

Oh of course, this is definitely not a blocker. Just asked out of curiosity if someone knows by heart how to get these numbers.

Wrong number of samples for C4?