microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs

Home Page:https://aka.ms/GeneralAI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dolly/RoBERTa Corpus dataset download

AInkCode opened this issue · comments

How to download the dolly or RoBERTa Corpus dataset?
Please give the url. Thanks~

RoBERTa Corpus is a combination of multiple sources, did not perform any form of filtering?
The bookcorpus dataset alone has 74M rows, but I saw that your Roberta folder is named 20M. May I ask what rules you use to filter the final data. I hope to receive your detailed description or if it is possible to publicly disclose your Roberta training data. Thank you for your help.