Checksum and FileNotFound errors when trying to evaluate pretrained models
catherinening opened this issue Β· comments
π Bug
I am trying to replicate the Hateful Memes baselines, and I tried two different commands. Each time, I get an error (either a FileNotFound error, or a checksum error). My commands, and errors, are reproduced below. For the FileNotFound error, a file with the same name, and a tar.gz extension exists in the cached directory, but not a file with a tar.gz.part extension, which is what the code is seeking. If I navigate to the cached directory and change the file extension to tar.gz.part, I then get a Checksum error, similar to the error in the second issue.
Command
To Reproduce
Steps to reproduce the behavior:
1. Running this sample command from mmf checkpointing page (https://mmf.sh/docs/tutorials/checkpointing):
mmf_run config=projects/visual_bert/configs/hateful_memes/from_coco.yaml \
model=visual_bert \
dataset=hateful_memes \
run_type=train_val
Error:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/storage/ice1/4/1/cning8/conda-env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/storage/ice1/4/1/cning8/mmf/mmf_cli/run.py", line 66, in distributed_main
main(configuration, init_distributed=True, predict=predict)
File "/storage/ice1/4/1/cning8/mmf/mmf_cli/run.py", line 52, in main
trainer.load()
File "/storage/ice1/4/1/cning8/mmf/mmf/trainers/mmf_trainer.py", line 46, in load
self.on_init_start()
File "/storage/ice1/4/1/cning8/mmf/mmf/trainers/core/callback_hook.py", line 20, in on_init_start
callback.on_init_start(**kwargs)
File "/storage/ice1/4/1/cning8/mmf/mmf/trainers/callbacks/checkpoint.py", line 30, in on_init_start
self._checkpoint.load_state_dict()
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 243, in load_state_dict
load_pretrained=ckpt_config.resume_pretrained,
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 270, in _load
ckpt, should_continue = self._load_from_zoo(file)
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 453, in _load_from_zoo
zoo_ckpt = load_pretrained_model(file)
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 162, in load_pretrained_model
return _load_pretrained_model(model_name_or_path_or_checkpoint, args, kwargs)
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 130, in _load_pretrained_model
download_path = download_pretrained_model(model_name_or_path, *args, **kwargs)
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 378, in download_pretrained_model
download_resources(resources, download_path, version)
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 397, in download_resources
download_resource(resource, download_path)
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 406, in download_resource
resource.download_file(download_path)
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 161, in download_file
self.checksum(download_path)
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 127, in checksum
f"[ Checksum for {self._file_name} from \n{self._url}\n"
AssertionError: [ Checksum for visual_bert.pretrained.coco_train_val.tar.gz from
https://dl.fbaipublicfiles.com/mmf/data/models/visual_bert/visual_bert.pretrained.coco_train_val.tar.gz
does not match the expected checksum. Please try again. ]
2. Running the below command (using the template from the mmf Hateful Memes repo), to reproduce the Visual BERT COCO baseline:
mmf_run config=projects/hateful_memes/configs/visual_bert/from_coco.yaml \
model=visual_bert dataset=hateful_memes \
run_type=val checkpoint.resume_zoo=visual_bert.finetuned.hateful_memes.from_coco \
checkpoint.resume_pretrained=False
Error:
- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/storage/ice1/4/1/cning8/conda-env/lib/python3.7/shutil.py", line 566, in move
os.rename(src, real_dst)
FileNotFoundError: [Errno 2] No such file or directory: '/storage/ice1/4/1/cning8/.cache/torch/mmf/data/models/visual_bert.finetuned.hateful_memes.from_coco/visual_bert.finetuned.hateful_memes_from_coco.tar.gz.part' -> '/storage/ice1/4/1/cning8/.cache/torch/mmf/data/models/visual_bert.finetuned.hateful_memes.from_coco/visual_bert.finetuned.hateful_memes_from_coco.tar.gz'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/storage/ice1/4/1/cning8/conda-env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/storage/ice1/4/1/cning8/mmf/mmf_cli/run.py", line 66, in distributed_main
main(configuration, init_distributed=True, predict=predict)
File "/storage/ice1/4/1/cning8/mmf/mmf_cli/run.py", line 52, in main
trainer.load()
File "/storage/ice1/4/1/cning8/mmf/mmf/trainers/mmf_trainer.py", line 46, in load
self.on_init_start()
File "/storage/ice1/4/1/cning8/mmf/mmf/trainers/core/callback_hook.py", line 20, in on_init_start
callback.on_init_start(**kwargs)
File "/storage/ice1/4/1/cning8/mmf/mmf/trainers/callbacks/checkpoint.py", line 30, in on_init_start
self._checkpoint.load_state_dict()
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 243, in load_state_dict
load_pretrained=ckpt_config.resume_pretrained,
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 270, in _load
ckpt, should_continue = self._load_from_zoo(file)
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 453, in _load_from_zoo
zoo_ckpt = load_pretrained_model(file)
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 162, in load_pretrained_model
return _load_pretrained_model(model_name_or_path_or_checkpoint, args, kwargs)
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/checkpoint.py", line 130, in _load_pretrained_model
download_path = download_pretrained_model(model_name_or_path, *args, **kwargs)
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 378, in download_pretrained_model
download_resources(resources, download_path, version)
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 397, in download_resources
download_resource(resource, download_path)
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 406, in download_resource
resource.download_file(download_path)
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 156, in download_file
self._url, download_path, self._file_name, redownload=redownload
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 303, in download
move(resume_file, outfile)
File "/storage/ice1/4/1/cning8/mmf/mmf/utils/download.py", line 422, in move
shutil.move(path1, path2)
File "/storage/ice1/4/1/cning8/conda-env/lib/python3.7/shutil.py", line 580, in move
copy_function(src, real_dst)
File "/storage/ice1/4/1/cning8/conda-env/lib/python3.7/shutil.py", line 266, in copy2
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/storage/ice1/4/1/cning8/conda-env/lib/python3.7/shutil.py", line 120, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/storage/ice1/4/1/cning8/.cache/torch/mmf/data/models/visual_bert.finetuned.hateful_memes.from_coco/visual_bert.finetuned.hateful_memes_from_coco.tar.gz.part'
Environment
The output from the following script is below:
python -m torch.utils.collect_env
PyTorch version: 1.11.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Red Hat Enterprise Linux Server release 7.9 (Maipo) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: version 2.8.12.2
Libc version: glibc-2.17
Python version: 3.7.16 (default, Jan 17 2023, 22:20:44) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.49.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB
Nvidia driver version: 535.86.10
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.21.4
[pip3] pytorch-lightning==1.6.0
[pip3] torch==1.11.0
[pip3] torchaudio==0.11.0
[pip3] torchmetrics==0.11.4
[pip3] torchtext==0.12.0
[pip3] torchvision==0.12.0
[conda] numpy 1.21.4 pypi_0 pypi
[conda] pytorch-lightning 1.6.0 pypi_0 pypi
[conda] torch 1.11.0 pypi_0 pypi
[conda] torchaudio 0.11.0 pypi_0 pypi
[conda] torchmetrics 0.11.4 pypi_0 pypi
[conda] torchtext 0.12.0 pypi_0 pypi
[conda] torchvision 0.12.0 pypi_0 pypi
Additional context
I set MMF_USER_DIR to the mmf/mmf directory in my local machine.
I have subsequently realized that both of these errors stem from training on more than one GPU. It seems that mmf models can only be trained on a single GPU.