Avoid downloading the whole dataset when only README.me has been touched on hub.
zinc75 opened this issue · comments
Feature request
datasets.load_dataset()
triggers a new download of the whole dataset when the README.md file has been touched on huggingface hub, even if data files / parquet files are the exact same.
I think the current behaviour of the load_dataset function is triggered whenever a change of the hash of latest commit on huggingface hub, but is there a clever way to only download again the dataset if and only if data is modified ?
Motivation
The current behaviour is a waste of network bandwidth / disk space / research time.
Your contribution
I don't have time to submit a PR, but I hope a simple solution will emerge from this issue !
you're right, we're tackling this here: huggingface/dataset-viewer#2757