How to specify dataset paths？

Question

How to specify dataset paths？

Jiahaohong opened this issue 3 months ago · comments

Jiahaohong commented 3 months ago

Kaichen Zhang - NTU · Answer 1 · Mon Mar 18 2024 18:08:01 GMT+0800 (China Standard Time)

Hi, can you provide more details about your issue?

If you mean dataset paths for huggingface dataset, you can go and change it in the yaml file of each task.

Jiahaohong · Answer 2 · Tue Mar 19 2024 10:00:14 GMT+0800 (China Standard Time)

Thanks. But if I want to use local datasets, how should I change the command and how to organize the datasets' structure?

Zhang Peiyuan · Answer 3 · Tue Mar 19 2024 15:00:54 GMT+0800 (China Standard Time)

I think so long you arrange your local dataset in the huggingface format and setting dataset_path to point to your dataset path, it should work fine.

MagicSource · Answer 4 · Thu Mar 21 2024 13:02:11 GMT+0800 (China Standard Time)

Where is yaml?

I just run python llms_eval.py --model llava --model_args pretrained="./checkpoints/llava-qwen-4b-finetune-490/checkpoint-1500/,conv_template=qwen" --tasks mme --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme --output_path ./logs/

I think it would be better to specific a root local dataset path? (we might download dataset to eval which needed )

There are many servers hard to connect internet but only share disks with develop machine, load from hf shouldn't be a force choice

Kaichen Zhang - NTU · Answer 5 · Fri Mar 22 2024 09:22:28 GMT+0800 (China Standard Time)

@lucasjinreal , the yaml file is in the folder of each task. For example lmms_eval/tasks/ai2d/ai2d.yaml

If you don't want to change the dataset_path one by one, you can download all the dataset first outside the server using the command lmms_eval --tasks list_with_num and then uploading the cache folder to your server disk. You can then export HF_HOME to your server cache folder and reuse cache dataset each time.

MagicSource · Answer 6 · Fri Mar 22 2024 10:31:56 GMT+0800 (China Standard Time)

@kcz358 Hi, is possible to send a dataset_root from args command line? So that it could be not such cubersome.

Li Bo · Answer 7 · Fri Mar 22 2024 11:30:31 GMT+0800 (China Standard Time)

You can set the environment variable to a folder and make your offline datasets all in this folder.

It will be something like:

export HF_HOME=/user/boli01/workspace/.cache/huggingface

And your offline dataset be like

/user/boli01/workspace/.cache/huggingface/dataset_A

MagicSource · Answer 8 · Fri Mar 22 2024 17:32:46 GMT+0800 (China Standard Time)

How can I offline download the dataset using git clone directly to cache folder?

It looks like it can not found it properly

MagicSource · Answer 9 · Fri Mar 22 2024 17:41:48 GMT+0800 (China Standard Time)

I still get this error:

huggingface_hub.utils._headers.LocalTokenNotFoundError: Token is required (token=True), but no token found. You need to provide a token or be logged in to Hugging Face with huggingface-cli login or huggingface_hub.login. See https://huggingface.co/settings/tokens.

I can not create any tokens on servers machine.

It still could be better provide a dataset_root path from main, so that users can download by their own.

Kaichen Zhang - NTU · Answer 10 · Fri Mar 22 2024 20:59:55 GMT+0800 (China Standard Time)

@lucasjinreal May I asked are you now encountering this error during evaluation or during download dataset.

By far our suggestion is to download the dataset you need in a machine that can access to huggingface and transfer the dataset to your server.

If you want to specify a dataset root from you local file, you can specify it by export HF_HOME=<your dataset dir>

You can also try setting token=False in the task folder. As long as you can make sure your local dataset path can be loaded by using load_dataset everything should be fine.

MagicSource · Answer 11 · Sun Mar 24 2024 10:11:17 GMT+0800 (China Standard Time)

@kcz358 Hi, I set the path to my local, and downloaded for example MME data to that folder, but the eval doesn't read from it at all.

I have already set variables.

How to set token false? I didn't see a key in yaml.

Still, I would recommend if task path can set directly from local and read from it if it available rather than ask to download even if it there.

Kaichen Zhang - NTU · Answer 12 · Sun Mar 24 2024 13:02:13 GMT+0800 (China Standard Time)

@lucasjinreal , You should make sure your path can be loaded through load_dataset method

In your yaml file, you can set

dataset_kwargs:
  token: False

And for your last recommendation, we have already suggested multiple times that you can control your cache folder path by export HF_HOME=<your cache folder>. You just need to download the data using hf load_dataset outside and upload to your server. HF will manage the dataset structure for you and you don't need to change it. Using the command lmms_eval --tasks list_with_num will download every dataset in our repo for you in the cache folder of your external server and you just need to upload the cache folder to your server.

And again, if you want to change your dataset path, you should check whether the path can be loaded by load_dataset method