Checksum and FileNotFound errors when trying to evaluate pretrained models

Question

Checksum and FileNotFound errors when trying to evaluate pretrained models

catherinening opened this issue 7 months ago · comments

🐛 Bug

I am trying to replicate the Hateful Memes baselines, and I tried two different commands. Each time, I get an error (either a FileNotFound error, or a checksum error). My commands, and errors, are reproduced below. For the FileNotFound error, a file with the same name, and a tar.gz extension exists in the cached directory, but not a file with a tar.gz.part extension, which is what the code is seeking. If I navigate to the cached directory and change the file extension to tar.gz.part, I then get a Checksum error, similar to the error in the second issue.

Command

To Reproduce

Steps to reproduce the behavior:

1. Running this sample command from mmf checkpointing page (https://mmf.sh/docs/tutorials/checkpointing):

 mmf_run config=projects/visual_bert/configs/hateful_memes/from_coco.yaml \
    model=visual_bert \
    dataset=hateful_memes \
    run_type=train_val

Error:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/storage/ice1/4/1/cning8/conda-env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/storage/ice1/4/1/cning8/mmf/mmf_cli/run.py", line 66, in distributed_main
    main(configuration, init_distributed=True, predict=predict)
  File "/storage/ice1/4/1/cning8/mmf/mmf_cli/run.py", line 52, in main
    trainer.load()
  File "/storage/ice1/4/1/cning8/mmf/mmf/trainers/mmf_trainer.py", line 46, in load
    self.on_init_start()
  File "/storage/ice1/4/1/cning8/mmf/mmf/trainers/core/callback_hook.py", line 20, in on_init_start
    callback.on_init_start(**kwargs)
  File "/storage/ice1/4/1/cning8/mmf/mmf/trainers/callbacks/checkpoint.py", line 30, in on_init_start
    self._checkpoint.load_state_dict()
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 243, in load_state_dict
    load_pretrained=ckpt_config.resume_pretrained,
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 270, in _load
    ckpt, should_continue = self._load_from_zoo(file)
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 453, in _load_from_zoo
    zoo_ckpt = load_pretrained_model(file)
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 162, in load_pretrained_model
    return _load_pretrained_model(model_name_or_path_or_checkpoint, args, kwargs)
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 130, in _load_pretrained_model
    download_path = download_pretrained_model(model_name_or_path, *args, **kwargs)
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 378, in download_pretrained_model
    download_resources(resources, download_path, version)
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 397, in download_resources
    download_resource(resource, download_path)
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 406, in download_resource
    resource.download_file(download_path)
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 161, in download_file
    self.checksum(download_path)
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 127, in checksum
    f"[ Checksum for {self._file_name} from \n{self._url}\n"
AssertionError: [ Checksum for visual_bert.pretrained.coco_train_val.tar.gz from 
https://dl.fbaipublicfiles.com/mmf/data/models/visual_bert/visual_bert.pretrained.coco_train_val.tar.gz
does not match the expected checksum. Please try again. ]

2. Running the below command (using the template from the mmf Hateful Memes repo), to reproduce the Visual BERT COCO baseline:

mmf_run config=projects/hateful_memes/configs/visual_bert/from_coco.yaml \
  model=visual_bert dataset=hateful_memes \
  run_type=val checkpoint.resume_zoo=visual_bert.finetuned.hateful_memes.from_coco \
  checkpoint.resume_pretrained=False

Error:

- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/storage/ice1/4/1/cning8/conda-env/lib/python3.7/shutil.py", line 566, in move
    os.rename(src, real_dst)
FileNotFoundError: [Errno 2] No such file or directory: '/storage/ice1/4/1/cning8/.cache/torch/mmf/data/models/visual_bert.finetuned.hateful_memes.from_coco/visual_bert.finetuned.hateful_memes_from_coco.tar.gz.part' -> '/storage/ice1/4/1/cning8/.cache/torch/mmf/data/models/visual_bert.finetuned.hateful_memes.from_coco/visual_bert.finetuned.hateful_memes_from_coco.tar.gz'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/storage/ice1/4/1/cning8/conda-env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/storage/ice1/4/1/cning8/mmf/mmf_cli/run.py", line 66, in distributed_main
    main(configuration, init_distributed=True, predict=predict)
  File "/storage/ice1/4/1/cning8/mmf/mmf_cli/run.py", line 52, in main
    trainer.load()
  File "/storage/ice1/4/1/cning8/mmf/mmf/trainers/mmf_trainer.py", line 46, in load
    self.on_init_start()
  File "/storage/ice1/4/1/cning8/mmf/mmf/trainers/core/callback_hook.py", line 20, in on_init_start
    callback.on_init_start(**kwargs)
  File "/storage/ice1/4/1/cning8/mmf/mmf/trainers/callbacks/checkpoint.py", line 30, in on_init_start
    self._checkpoint.load_state_dict()
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 243, in load_state_dict
    load_pretrained=ckpt_config.resume_pretrained,
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 270, in _load
    ckpt, should_continue = self._load_from_zoo(file)
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 453, in _load_from_zoo
    zoo_ckpt = load_pretrained_model(file)
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 162, in load_pretrained_model
    return _load_pretrained_model(model_name_or_path_or_checkpoint, args, kwargs)
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 130, in _load_pretrained_model
    download_path = download_pretrained_model(model_name_or_path, *args, **kwargs)
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 378, in download_pretrained_model
    download_resources(resources, download_path, version)
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 397, in download_resources
    download_resource(resource, download_path)
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 406, in download_resource
    resource.download_file(download_path)
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 156, in download_file
    self._url, download_path, self._file_name, redownload=redownload
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 303, in download
    move(resume_file, outfile)
  File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 422, in move
    shutil.move(path1, path2)
  File "/storage/ice1/4/1/cning8/conda-env/lib/python3.7/shutil.py", line 580, in move
    copy_function(src, real_dst)
  File "/storage/ice1/4/1/cning8/conda-env/lib/python3.7/shutil.py", line 266, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/storage/ice1/4/1/cning8/conda-env/lib/python3.7/shutil.py", line 120, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/storage/ice1/4/1/cning8/.cache/torch/mmf/data/models/visual_bert.finetuned.hateful_memes.from_coco/visual_bert.finetuned.hateful_memes_from_coco.tar.gz.part'

Environment

The output from the following script is below:

python -m torch.utils.collect_env

PyTorch version: 1.11.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Red Hat Enterprise Linux Server release 7.9 (Maipo) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: version 2.8.12.2
Libc version: glibc-2.17

Python version: 3.7.16 (default, Jan 17 2023, 22:20:44)  [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.49.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB

Nvidia driver version: 535.86.10
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.4
[pip3] pytorch-lightning==1.6.0
[pip3] torch==1.11.0
[pip3] torchaudio==0.11.0
[pip3] torchmetrics==0.11.4
[pip3] torchtext==0.12.0
[pip3] torchvision==0.12.0
[conda] numpy                     1.21.4                   pypi_0    pypi
[conda] pytorch-lightning         1.6.0                    pypi_0    pypi
[conda] torch                     1.11.0                   pypi_0    pypi
[conda] torchaudio                0.11.0                   pypi_0    pypi
[conda] torchmetrics              0.11.4                   pypi_0    pypi
[conda] torchtext                 0.12.0                   pypi_0    pypi
[conda] torchvision               0.12.0                   pypi_0    pypi

Additional context

I set MMF_USER_DIR to the mmf/mmf directory in my local machine.

catherinening · Answer 1 · Tue Nov 28 2023 05:46:06 GMT+0800 (China Standard Time)

I have subsequently realized that both of these errors stem from training on more than one GPU. It seems that mmf models can only be trained on a single GPU.