Question about the pre-train dataset.

Question

TLB-MISS opened this issue 2 years ago · comments

The huggingface bookcorpus dataset I got is about 4.6 GB, and the wikipedia_en dataset is about 19 GB. Is this capacity correct for the dataset you used for general distillation? Please tell me the capacity of each dataset.
How did you split the general distillation dataset? Split it into train/validation/test dataset? Or split it into train/test dataset? Or just use whole set as general training?

Thanks