huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Home Page:https://huggingface.co/docs/datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add the option of saving in parquet instead of arrow

arita37 opened this issue · comments

commented

Feature request

In dataset.save_to_disk('/path/to/save/dataset'),

add the option to save in parquet format

dataset.save_to_disk('/path/to/save/dataset', format="parquet"),

because arrow is not used for Production Big data.... (only parquet)

Motivation

because arrow is not used for Production Big data.... (only parquet)

Your contribution

I can do the testing !

I think Dataset.to_parquet is what you're looking for.

Let me know if I'm wrong

commented

You can use to_parquet and ds.info.write_to_directory() to save the dataset info

commented

Yes, and there is DatasetInfo.from_directory(). to reload the info

commented

load_dataset doesn't load the dataset in memory, it progressively writes to disk in Arrow format and then memory maps the Arrow files. This allows to load datasets bigger than memory and without filling your RAM

commented