xiuqhou / Salience-DETR

[CVPR 2024] Official implementation of the paper "Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement"

Home Page:https://arxiv.org/abs/2403.16131

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

z972778371 opened this issue · comments

第一次运行accelerate main.py后,程序加载到下载resnet50的预训练模型,但是没下载完,然后报错RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
可能是网络问题,但我退出终端,想再次运行的时候,程序不下载文件了,转而报错

[2024-05-08 08:47:13 det.models.backbones.base_backbone]: Backbone architecture: resnet50
Loading extension module MultiScaleDeformableAttention...
Traceback (most recent call last):
File "main.py", line 205, in
train()
File "main.py", line 124, in train
model = Config(cfg.model_path).model
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/lazy_load.py", line 24, in init
exec(code, name_space)
File "", line 34, in
File "/home/ubuntu/Zhu/python project/Salience-DETR/models/backbones/resnet.py", line 412, in new
weights = load_checkpoint(default_weight if weights is None else weights)
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/utils.py", line 373, in load_checkpoint
return torch.hub.load_state_dict_from_url(file_name, map_location=map_location)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/hub.py", line 770, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location, weights_only=weights_only)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 457, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
Traceback (most recent call last):
File "main.py", line 205, in
train()
File "main.py", line 124, in train
model = Config(cfg.model_path).model
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/lazy_load.py", line 24, in init
exec(code, name_space)
File "", line 34, in
File "/home/ubuntu/Zhu/python project/Salience-DETR/models/backbones/resnet.py", line 412, in new
weights = load_checkpoint(default_weight if weights is None else weights)
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/utils.py", line 373, in load_checkpoint
return torch.hub.load_state_dict_from_url(file_name, map_location=map_location)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/hub.py", line 770, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location, weights_only=weights_only)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 457, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
Traceback (most recent call last):
File "main.py", line 205, in
train()
File "main.py", line 124, in train
model = Config(cfg.model_path).model
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/lazy_load.py", line 24, in init
exec(code, name_space)
File "", line 34, in
File "/home/ubuntu/Zhu/python project/Salience-DETR/models/backbones/resnet.py", line 412, in new
weights = load_checkpoint(default_weight if weights is None else weights)
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/utils.py", line 373, in load_checkpoint
return torch.hub.load_state_dict_from_url(file_name, map_location=map_location)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/hub.py", line 770, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location, weights_only=weights_only)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 457, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
Traceback (most recent call last):
File "main.py", line 205, in
train()
File "main.py", line 124, in train
model = Config(cfg.model_path).model
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/lazy_load.py", line 24, in init
exec(code, name_space)
File "", line 34, in
File "/home/ubuntu/Zhu/python project/Salience-DETR/models/backbones/resnet.py", line 412, in new
weights = load_checkpoint(default_weight if weights is None else weights)
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/utils.py", line 373, in load_checkpoint
return torch.hub.load_state_dict_from_url(file_name, map_location=map_location)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/hub.py", line 770, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location, weights_only=weights_only)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 457, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
[2024-05-08 08:47:31,260] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5937 closing signal SIGTERM
[2024-05-08 08:47:31,261] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5938 closing signal SIGTERM
[2024-05-08 08:47:31,877] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 5939) of binary: /home/ubuntu/anaconda3/envs/salience_detr/bin/python
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/salience_detr/bin/accelerate", line 8, in
sys.exit(main())
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
multi_gpu_launcher(args)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
distrib_run.run(args)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
[1]:
time : 2024-05-08_08:47:31
host : ubuntu-X640-G30
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 5940)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-05-08_08:47:31
host : ubuntu-X640-G30
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 5939)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

请问现在该怎么办?