karpathy / llama2.c

Inference Llama 2 in one file of pure C

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Train/val split

DavidHerel opened this issue · comments

Hi,

I want to ask how one can split a dataset to train/val splits. In the tinystories.py I don't quite understand the comment:

train/test split. let's use only shard 0 for test split, rest train

So how many tokens from train data are selected to be validation split?

It seems that @karpathy uses 10shards and if only 0 shard is used as a test split then it means that 1/10 of the data is used as a test set?
e.g. if I have dataset with 10B tokens then 1B tokens are used for test/val set?