EleutherAI / the-pile

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reducing download size

marionbartl opened this issue · comments

Hi! I would like to create a subset of the pile that is ~5G in size. The final subset should follow the original distribution of datasets and the documents included should be randomly sampled from the datasets.

I tried to work with the --limit, --read_amount, and --make_dataset_samples parameters to reduce the download size, but when I run the script, each dataset is downloaded in the original size.

I would greatly appreciate it if you could tell me whether what I'm looking for is achievable with this repo and what the command for that would be.

Thanks!