EleutherAI / gpt-neox

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

Home Page:https://www.eleuther.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

some Datasets are not available

vangogh0318 opened this issue · comments

Describe the bug
can not download the github/ArXiv dataset. the url is wrong
how to download github/ArXiv data? thank you

The code in corpora.py file, line 190:
class Github(DataDownloader):
name = "github"
urls = ["http://eaidata.bmk.sh/data/github_small.jsonl.zst"]

class ArXiv(DataDownloader):
name = "arxiv"
urls = [
"https://the-eye.eu/public/AI/pile_preliminary_components/2020-09-08-arxiv-extracts-nofallback-until-2007-068.tar.gz"
]

This is correct. The Pile has been taken down to a DMCA takedown request.

hi, how can I access the Pile data? Thanks
@StellaAthena