huawei-noah / Pretrained-Language-Model

Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question about the pre-train dataset.

TLB-MISS opened this issue · comments

  1. The huggingface bookcorpus dataset I got is about 4.6 GB, and the wikipedia_en dataset is about 19 GB. Is this capacity correct for the dataset you used for general distillation? Please tell me the capacity of each dataset.

  2. How did you split the general distillation dataset? Split it into train/validation/test dataset? Or split it into train/test dataset? Or just use whole set as general training?

Thanks