Add the option of saving in parquet instead of arrow
arita37 opened this issue · comments
Feature request
In dataset.save_to_disk('/path/to/save/dataset'),
add the option to save in parquet format
dataset.save_to_disk('/path/to/save/dataset', format="parquet"),
because arrow is not used for Production Big data.... (only parquet)
Motivation
because arrow is not used for Production Big data.... (only parquet)
Your contribution
I can do the testing !
I think Dataset.to_parquet
is what you're looking for.
Let me know if I'm wrong
You can use to_parquet
and ds.info.write_to_directory()
to save the dataset info
Yes, and there is DatasetInfo.from_directory(). to reload the info
load_dataset
doesn't load the dataset in memory, it progressively writes to disk in Arrow format and then memory maps the Arrow files. This allows to load datasets bigger than memory and without filling your RAM