mlfoundations / datacomp

DataComp: In search of the next generation of multimodal datasets

Home Page:http://datacomp.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tried evaluate the model on a local network only machine

zwsjink opened this issue · comments

Well, I first use

python download_evalsets.py $download_dir

to download all the necessary datasets on an internet-accessible machine and then migrate the data to my machine with limited internet access.
All the other evaluation went well but the retrieval datasets , which use hf_cache/ directory instead.

The error goes like this :

>>> datasets.load_dataset("nlphuji/flickr_1k_test_image_text_retrieval",split="test", cache_dir=os.path.join("/mnt/data/datacom2023/evaluate_datasets", "hf_cache"))Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/anaconda3/envs/datacomp/lib/python3.10/site-packages/datasets/load.py", line 2129, in load_dataset
    builder_instance = load_dataset_builder(
  File "/root/anaconda3/envs/datacomp/lib/python3.10/site-packages/datasets/load.py", line 1815, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/root/anaconda3/envs/datacomp/lib/python3.10/site-packages/datasets/load.py", line 1512, in dataset_module_factory
    raise e1 from None
  File "/root/anaconda3/envs/datacomp/lib/python3.10/site-packages/datasets/load.py", line 1468, in dataset_module_factory
    raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({type(e).__name__})")
ConnectionError: Couldn't reach 'nlphuji/flickr_1k_test_image_text_retrieval' on the Hub (ConnectTimeout)

Seems like the huggingface datasets module is still trying to connect to the internet. Is there any trick I can play to skip the connection to huggingface? The evaluation command :

python evaluate.py --train_output_dir /mnt/data/datacomp2023/train_output/basic_train --data_dir /mnt/data/datacomp2023/evaluate_datasets

Hi, thanks for bringing this up! I assumed that the HF datasets would work properly without Internet connection because the download_evalsets.py script loads them once to put them in the cache already. I'll look into potential solutions to this issue

Can you try setting the environment variable HF_DATASETS_OFFLINE to 1? (from https://huggingface.co/docs/datasets/v2.14.5/en/loading#offline)
It seems like even if the dataset is cached, HF will by default check the online version. So hopefully this should fix things.

If that doesn't work, could you check to make sure the files are indeed in the hf_cache folder?

Sorry to get back to you late, but I was able to bypass this issue by modifying the datacomp source code as follows:

diff --git a/eval_utils/retr_eval.py b/eval_utils/retr_eval.py
index 3c19917..647edf7 100644
--- a/eval_utils/retr_eval.py
+++ b/eval_utils/retr_eval.py
@@ -37,7 +37,7 @@ def evaluate_retrieval_dataset(
 
     dataset = RetrievalDataset(
         datasets.load_dataset(
-            f"nlphuji/{task.replace('retrieval/', '')}",
+            f"/mnt/data/datacomp2023/evaluate_datasets/{task.replace('retrieval/', '')}.py",
             split="test",
             cache_dir=os.path.join(data_root, "hf_cache")
             if data_root is not None

which force the hf to use my local dataset repository instead of checking any online updates.