get_auto_dataset path logic does not work properly when dataset_id is a path

Question

get_auto_dataset path logic does not work properly when dataset_id is a path

dtsip opened this issue 2 years ago · comments

Describe the bug
HuggingFace datasets allows you to specify a local directory as dataset_id and looks for data files there. This means that, in the existing Mistral codebase, I can very easily load a custom dataset by setting the dataset_id in the config file as the local path.

However, the way the get_auto_dataset function handles caching directories (e.g., here) does not take that into account. As a result, when dataset_id is an absolute path, the path concatenation ends up pointing to the original dataset directory and we end up with preprocessing files in it, which is undesirable.

To Reproduce
Provide an absolute path as dataset_id (e.g., here).

Expected behavior
The dataset should be loaded normally and any intermediate/temporary files should only be produced in caching_dir/datasets.

Dimitris Tsipras · Answer 1 · Wed Apr 06 2022 14:20:06 GMT+0800 (China Standard Time)

One way to handle this is to add some if os.isdir(dataset_id) logic into get_auto_dataset, based on how HF is caching datasets coming from local paths. If this makes sense I can try submitting a PR.

David Hall · Answer 2 · Thu Apr 14 2022 01:31:56 GMT+0800 (China Standard Time)

thanks! i think it's probably simplest to just mangle /s to -'s. What do you think?

Dimitris Tsipras · Answer 3 · Thu Apr 14 2022 01:36:42 GMT+0800 (China Standard Time)

I gave it a couple of tries (it would be clean of you dataset_id was absolute_path_to_local_file but huggingface ends up storing things under something like json. Will try to see if there is a way around it.

David Hall · Answer 4 · Thu Apr 14 2022 02:00:19 GMT+0800 (China Standard Time)

sorry i'm not sure i follow that?

Dimitris Tsipras · Answer 5 · Thu Apr 14 2022 02:06:53 GMT+0800 (China Standard Time)

Oh, just that when loading from local, HuggingFace sets dataset_id to be something like json (presumably based on what files it found). But Mistral assumes that the files are stored under dataset_id which causes files to not be found. I will investigate how we can make the two interact better.

David Hall · Answer 6 · Thu Apr 14 2022 03:07:32 GMT+0800 (China Standard Time)

ah i see. i wonder if we should just rely on HF's own caching logic rather than trying to manage this ourselves...

David Hall · Answer 7 · Thu Apr 14 2022 03:21:05 GMT+0800 (China Standard Time)

looking at it, i think we can probably get away with not passing in train_indices_cache_file_name and test_indices_cache_file_name since they'll autogenerate in the same cache_dir (I think)... Do you want to try that?

David Hall · Answer 8 · Tue Jun 07 2022 05:45:26 GMT+0800 (China Standard Time)

i fixed this for new-style datasets