Question about the pre-train dataset.
TLB-MISS opened this issue · comments
Sangwon Beak commented
-
The huggingface bookcorpus dataset I got is about 4.6 GB, and the wikipedia_en dataset is about 19 GB. Is this capacity correct for the dataset you used for general distillation? Please tell me the capacity of each dataset.
-
How did you split the general distillation dataset? Split it into train/validation/test dataset? Or split it into train/test dataset? Or just use whole set as general training?
Thanks