huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Home Page:https://huggingface.co/docs/datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Avoid downloading the whole dataset when only README.me has been touched on hub.

zinc75 opened this issue · comments

Feature request

datasets.load_dataset() triggers a new download of the whole dataset when the README.md file has been touched on huggingface hub, even if data files / parquet files are the exact same.

I think the current behaviour of the load_dataset function is triggered whenever a change of the hash of latest commit on huggingface hub, but is there a clever way to only download again the dataset if and only if data is modified ?

Motivation

The current behaviour is a waste of network bandwidth / disk space / research time.

Your contribution

I don't have time to submit a PR, but I hope a simple solution will emerge from this issue !

you're right, we're tackling this here: huggingface/dataset-viewer#2757

@severo : great !