EvolvingLMMs-Lab / lmms-eval

Accelerating the development of large multimodal models (LMMs) with lmms-eval

Home Page:https://lmms-lab.github.io/lmms-eval-blog/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to specify dataset paths?

Jiahaohong opened this issue · comments

Hi, can you provide more details about your issue?

If you mean dataset paths for huggingface dataset, you can go and change it in the yaml file of each task.

Thanks. But if I want to use local datasets, how should I change the command and how to organize the datasets' structure?

I think so long you arrange your local dataset in the huggingface format and setting dataset_path to point to your dataset path, it should work fine.

Where is yaml?

I just run python llms_eval.py --model llava --model_args pretrained="./checkpoints/llava-qwen-4b-finetune-490/checkpoint-1500/,conv_template=qwen" --tasks mme --batch_size 1 --log_samples --log_samples_suffix llava_v1.5_mme --output_path ./logs/

I think it would be better to specific a root local dataset path? (we might download dataset to eval which needed )

There are many servers hard to connect internet but only share disks with develop machine, load from hf shouldn't be a force choice

@lucasjinreal , the yaml file is in the folder of each task. For example lmms_eval/tasks/ai2d/ai2d.yaml

If you don't want to change the dataset_path one by one, you can download all the dataset first outside the server using the command lmms_eval --tasks list_with_num and then uploading the cache folder to your server disk. You can then export HF_HOME to your server cache folder and reuse cache dataset each time.

@kcz358 Hi, is possible to send a dataset_root from args command line? So that it could be not such cubersome.

You can set the environment variable to a folder and make your offline datasets all in this folder.

It will be something like:

export HF_HOME=/user/boli01/workspace/.cache/huggingface

And your offline dataset be like

/user/boli01/workspace/.cache/huggingface/dataset_A

How can I offline download the dataset using git clone directly to cache folder?

It looks like it can not found it properly

I still get this error:

huggingface_hub.utils._headers.LocalTokenNotFoundError: Token is required (token=True), but no token found. You need to provide a token or be logged in to Hugging Face with huggingface-cli login or huggingface_hub.login. See https://huggingface.co/settings/tokens.

I can not create any tokens on servers machine.

It still could be better provide a dataset_root path from main, so that users can download by their own.

@lucasjinreal May I asked are you now encountering this error during evaluation or during download dataset.

By far our suggestion is to download the dataset you need in a machine that can access to huggingface and transfer the dataset to your server.

If you want to specify a dataset root from you local file, you can specify it by export HF_HOME=<your dataset dir>

You can also try setting token=False in the task folder. As long as you can make sure your local dataset path can be loaded by using load_dataset everything should be fine.

@kcz358 Hi, I set the path to my local, and downloaded for example MME data to that folder, but the eval doesn't read from it at all.

I have already set variables.

How to set token false? I didn't see a key in yaml.

Still, I would recommend if task path can set directly from local and read from it if it available rather than ask to download even if it there.

@lucasjinreal , You should make sure your path can be loaded through load_dataset method

In your yaml file, you can set

dataset_kwargs:
  token: False

And for your last recommendation, we have already suggested multiple times that you can control your cache folder path by export HF_HOME=<your cache folder>. You just need to download the data using hf load_dataset outside and upload to your server. HF will manage the dataset structure for you and you don't need to change it. Using the command lmms_eval --tasks list_with_num will download every dataset in our repo for you in the cache folder of your external server and you just need to upload the cache folder to your server.

And again, if you want to change your dataset path, you should check whether the path can be loaded by load_dataset method