stanford-crfm / mistral

Mistral: A strong, northwesterly wind: Framework for transparent and accessible large-scale language model training, built with Hugging Face 🤗 Transformers.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

get_auto_dataset path logic does not work properly when dataset_id is a path

dtsip opened this issue · comments

Describe the bug
HuggingFace datasets allows you to specify a local directory as dataset_id and looks for data files there. This means that, in the existing Mistral codebase, I can very easily load a custom dataset by setting the dataset_id in the config file as the local path.

However, the way the get_auto_dataset function handles caching directories (e.g., here) does not take that into account. As a result, when dataset_id is an absolute path, the path concatenation ends up pointing to the original dataset directory and we end up with preprocessing files in it, which is undesirable.

To Reproduce
Provide an absolute path as dataset_id (e.g., here).

Expected behavior
The dataset should be loaded normally and any intermediate/temporary files should only be produced in caching_dir/datasets.

One way to handle this is to add some if os.isdir(dataset_id) logic into get_auto_dataset, based on how HF is caching datasets coming from local paths. If this makes sense I can try submitting a PR.

thanks! i think it's probably simplest to just mangle /s to -'s. What do you think?

I gave it a couple of tries (it would be clean of you dataset_id was absolute_path_to_local_file but huggingface ends up storing things under something like json. Will try to see if there is a way around it.

sorry i'm not sure i follow that?

Oh, just that when loading from local, HuggingFace sets dataset_id to be something like json (presumably based on what files it found). But Mistral assumes that the files are stored under dataset_id which causes files to not be found. I will investigate how we can make the two interact better.

ah i see. i wonder if we should just rely on HF's own caching logic rather than trying to manage this ourselves...

looking at it, i think we can probably get away with not passing in train_indices_cache_file_name and test_indices_cache_file_name since they'll autogenerate in the same cache_dir (I think)... Do you want to try that?

i fixed this for new-style datasets