mosaicml / llm-foundry

LLM training code for Databricks foundation models

Home Page:https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wrong number of samples for C4?

eldarkurtic opened this issue · comments

In the convert_dataset_hf.py script, the number of raw samples for C4 is set as 364_868_892

Where is this number coming from? According to HuggingFace hub, C4 train split has 2.21M rows. https://huggingface.co/datasets/c4

Edit: seems like this number is given in the DataSplit section https://huggingface.co/datasets/c4#data-splits. So let me rephrase my question: How do we determine this number for a new dataset that we want to process with llmfoundry? For example SlimPajama (https://huggingface.co/datasets/cerebras/SlimPajama-627B) doesn't have this number stated anywhere.

Hey Eldar, that hardcoded number is solely used for the progress bar in that script, you can definitely proceed without defining it. Would have to dig a bit to figure out where that sample count came from, but hopefully that unblocks you.

Oh of course, this is definitely not a blocker. Just asked out of curiosity if someone knows by heart how to get these numbers.