Reducing download size
marionbartl opened this issue · comments
Marion Bartl commented
Hi! I would like to create a subset of the pile that is ~5G in size. The final subset should follow the original distribution of datasets and the documents included should be randomly sampled from the datasets.
I tried to work with the --limit
, --read_amount
, and --make_dataset_samples
parameters to reduce the download size, but when I run the script, each dataset is downloaded in the original size.
I would greatly appreciate it if you could tell me whether what I'm looking for is achievable with this repo and what the command for that would be.
Thanks!