EleutherAI / the-pile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pile-CC Size

KeremTurgutlu opened this issue · comments

I am writing a data pipeline to process common crawl and referencing your code here in pile-cc repo. In this repo PILE-CC raw version accounts for 200 GB however in the pile-cc repo we see that

3.5PB of network ingress in total is required. The final dataset should be (warning: this number is very rough and extrapolated; leave some slack space to be safe!) about 200TB. About 40k core days (non-hyperthreaded) are also required (again, a very rough estimate from extrapolation).

I am a bit confused about the difference here between 200TB and 200GB. Was there another pipeline which reduced the size from 200TB to 200GB? If so I am not able to find it. Thanks!