dolly/RoBERTa Corpus dataset download
AInkCode opened this issue · comments
How to download the dolly or RoBERTa Corpus dataset?
Please give the url. Thanks~
Dolly: https://huggingface.co/datasets/databricks/databricks-dolly-15k
RoBERTa Corpus is a combination of multiple sources:
https://huggingface.co/datasets/wikicorpus
https://huggingface.co/datasets/bookcorpus
https://huggingface.co/datasets/cc_news
https://huggingface.co/datasets/Skylion007/openwebtext
RoBERTa Corpus is a combination of multiple sources, did not perform any form of filtering?
The bookcorpus dataset alone has 74M rows, but I saw that your Roberta folder is named 20M. May I ask what rules you use to filter the final data. I hope to receive your detailed description or if it is possible to publicly disclose your Roberta training data. Thank you for your help.