Hello everyone, sorry to interrupt. I'm encountering a problem and I'm not sure why it's happening. I've already set up and generated the 'gt_png', 'label_png', 'train.txt', and 'val.txt' in the dataset, but I still encounter an error when I input the following command: bash Copy code torchrun --nproc_per_node=2 tools/train_amp.py --finetune-from /home/xyh/.cache/torch/hub/checkpoints/backbone_v2.pth --config ./configs/bisenetv2_city.py Could someone please help me understand why this is happening?

Question

Hello everyone, sorry to interrupt. I'm encountering a problem and I'm not sure why it's happening. I've already set up and generated the 'gt_png', 'label_png', 'train.txt', and 'val.txt' in the dataset, but I still encounter an error when I input the following command: bash Copy code torchrun --nproc_per_node=2 tools/train_amp.py --finetune-from /home/xyh/.cache/torch/hub/checkpoints/backbone_v2.pth --config ./configs/bisenetv2_city.py Could someone please help me understand why this is happening?

HorizonXYH opened this issue a year ago · comments

WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

load pretrained weights from /home/xyh/.cache/torch/hub/checkpoints/backbone_v2.pth
missing keys: ["detail.S1.0.conv.weight", "detail.S1.0.bn.weight", "detail.S1.0.bn.bias", "detail.S1.0.bn.running_mean", "detail.S1.0.bn.running_var", "detail.S1.1.conv.weight", "detail.S1.1.bn.weight", "detail.S1.1.bn.bias", "detail.S1.1.bn.running_mean", "detail.S1.1.bn.running_var", "detail.S2.0.conv.weight", "detail.S2.0.bn.weight", "detail.S2.0.bn.bias", "detail.S2.0.bn.running_mean", "detail.S2.0.bn.running_var", "detail.S2.1.conv.weight", "detail.S2.1.bn.weight", "detail.S2.1.bn.bias", "detail.S2.1.bn.running_mean", "detail.S2.1.bn.running_var", "detail.S2.2.conv.weight", "detail.S2.2.bn.weight", "detail.S2.2.bn.bias", "detail.S2.2.bn.running_mean", "detail.S2.2.bn.running_var", "detail.S3.0.conv.weight", "detail.S3.0.bn.weight", "detail.S3.0.bn.bias", "detail.S3.0.bn.running_mean", "detail.S3.0.bn.running_var", "detail.S3.1.conv.weight", "detail.S3.1.bn.weight", "detail.S3.1.bn.bias", "detail.S3.1.bn.running_mean", "detail.S3.1.bn.running_var", "detail.S3.2.conv.weight", "detail.S3.2.bn.weight", "detail.S3.2.bn.bias", "detail.S3.2.bn.running_mean", "detail.S3.2.bn.running_var", "segment.S1S2.conv.conv.weight", "segment.S1S2.conv.bn.weight", "segment.S1S2.conv.bn.bias", "segment.S1S2.conv.bn.running_mean", "segment.S1S2.conv.bn.running_var", "segment.S1S2.left.0.conv.weight", "segment.S1S2.left.0.bn.weight", "segment.S1S2.left.0.bn.bias", "segment.S1S2.left.0.bn.running_mean", "segment.S1S2.left.0.bn.running_var", "segment.S1S2.left.1.conv.weight", "segment.S1S2.left.1.bn.weight", "segment.S1S2.left.1.bn.bias", "segment.S1S2.left.1.bn.running_mean", "segment.S1S2.left.1.bn.running_var", "segment.S1S2.fuse.conv.weight", "segment.S1S2.fuse.bn.weight", "segment.S1S2.fuse.bn.bias", "segment.S1S2.fuse.bn.running_mean", "segment.S1S2.fuse.bn.running_var", "segment.S3.0.conv1.conv.weight", "segment.S3.0.conv1.bn.weight", "segment.S3.0.conv1.bn.bias", "segment.S3.0.conv1.bn.running_mean", "segment.S3.0.conv1.bn.running_var", "segment.S3.0.dwconv1.0.weight", "segment.S3.0.dwconv1.1.weight", "segment.S3.0.dwconv1.1.bias", "segment.S3.0.dwconv1.1.running_mean", "segment.S3.0.dwconv1.1.running_var", "segment.S3.0.dwconv2.0.weight", "segment.S3.0.dwconv2.1.weight", "segment.S3.0.dwconv2.1.bias", "segment.S3.0.dwconv2.1.running_mean", "segment.S3.0.dwconv2.1.running_var", "segment.S3.0.conv2.0.weight", "segment.S3.0.conv2.1.weight", "segment.S3.0.conv2.1.bias", "segment.S3.0.conv2.1.running_mean", "segment.S3.0.conv2.1.running_var", "segment.S3.0.shortcut.0.weight", "segment.S3.0.shortcut.1.weight", "segment.S3.0.shortcut.1.bias", "segment.S3.0.shortcut.1.running_mean", "segment.S3.0.shortcut.1.running_var", "segment.S3.0.shortcut.2.weight", "segment.S3.0.shortcut.3.weight", "segment.S3.0.shortcut.3.bias", "segment.S3.0.shortcut.3.running_mean", "segment.S3.0.shortcut.3.running_var", "segment.S3.1.conv1.conv.weight", "segment.S3.1.conv1.bn.weight", "segment.S3.1.conv1.bn.bias", "segment.S3.1.conv1.bn.running_mean", "segment.S3.1.conv1.bn.running_var", "segment.S3.1.dwconv.0.weight", "segment.S3.1.dwconv.1.weight", "segment.S3.1.dwconv.1.bias", "segment.S3.1.dwconv.1.running_mean", "segment.S3.1.dwconv.1.running_var", "segment.S3.1.conv2.0.weight", "segment.S3.1.conv2.1.weight", "segment.S3.1.conv2.1.bias", "segment.S3.1.conv2.1.running_mean", "segment.S3.1.conv2.1.running_var", "segment.S4.0.conv1.conv.weight", "segment.S4.0.conv1.bn.weight", "segment.S4.0.conv1.bn.bias", "segment.S4.0.conv1.bn.running_mean", "segment.S4.0.conv1.bn.running_var", "segment.S4.0.dwconv1.0.weight", "segment.S4.0.dwconv1.1.weight", "segment.S4.0.dwconv1.1.bias", "segment.S4.0.dwconv1.1.running_mean", "segment.S4.0.dwconv1.1.running_var", "segment.S4.0.dwconv2.0.weight", "segment.S4.0.dwconv2.1.weight", "segment.S4.0.dwconv2.1.bias", "segment.S4.0.dwconv2.1.running_mean", "segment.S4.0.dwconv2.1.running_var", "segment.S4.0.conv2.0.weight", "segment.S4.0.conv2.1.weight", "segment.S4.0.conv2.1.bias", "segment.S4.0.conv2.1.running_mean", "segment.S4.0.conv2.1.running_var", "segment.S4.0.shortcut.0.weight", "segment.S4.0.shortcut.1.weight", "segment.S4.0.shortcut.1.bias", "segment.S4.0.shortcut.1.running_mean", "segment.S4.0.shortcut.1.running_var", "segment.S4.0.shortcut.2.weight", "segment.S4.0.shortcut.3.weight", "segment.S4.0.shortcut.3.bias", "segment.S4.0.shortcut.3.running_mean", "segment.S4.0.shortcut.3.running_var", "segment.S4.1.conv1.conv.weight", "segment.S4.1.conv1.bn.weight", "segment.S4.1.conv1.bn.bias", "segment.S4.1.conv1.bn.running_mean", "segment.S4.1.conv1.bn.running_var", "segment.S4.1.dwconv.0.weight", "segment.S4.1.dwconv.1.weight", "segment.S4.1.dwconv.1.bias", "segment.S4.1.dwconv.1.running_mean", "segment.S4.1.dwconv.1.running_var", "segment.S4.1.conv2.0.weight", "segment.S4.1.conv2.1.weight", "segment.S4.1.conv2.1.bias", "segment.S4.1.conv2.1.running_mean", "segment.S4.1.conv2.1.running_var", "segment.S5_4.0.conv1.conv.weight", "segment.S5_4.0.conv1.bn.weight", "segment.S5_4.0.conv1.bn.bias", "segment.S5_4.0.conv1.bn.running_mean", "segment.S5_4.0.conv1.bn.running_var", "segment.S5_4.0.dwconv1.0.weight", "segment.S5_4.0.dwconv1.1.weight", "segment.S5_4.0.dwconv1.1.bias", "segment.S5_4.0.dwconv1.1.running_mean", "segment.S5_4.0.dwconv1.1.running_var", "segment.S5_4.0.dwconv2.0.weight", "segment.S5_4.0.dwconv2.1.weight", "segment.S5_4.0.dwconv2.1.bias", "segment.S5_4.0.dwconv2.1.running_mean", "segment.S5_4.0.dwconv2.1.running_var", "segment.S5_4.0.conv2.0.weight", "segment.S5_4.0.conv2.1.weight", "segment.S5_4.0.conv2.1.bias", "segment.S5_4.0.conv2.1.running_mean", "segment.S5_4.0.conv2.1.running_var", "segment.S5_4.0.shortcut.0.weight", "segment.S5_4.0.shortcut.1.weight", "segment.S5_4.0.shortcut.1.bias", "segment.S5_4.0.shortcut.1.running_mean", "segment.S5_4.0.shortcut.1.running_var", "segment.S5_4.0.shortcut.2.weight", "segment.S5_4.0.shortcut.3.weight", "segment.S5_4.0.shortcut.3.bias", "segment.S5_4.0.shortcut.3.running_mean", "segment.S5_4.0.shortcut.3.running_var", "segment.S5_4.1.conv1.conv.weight", "segment.S5_4.1.conv1.bn.weight", "segment.S5_4.1.conv1.bn.bias", "segment.S5_4.1.conv1.bn.running_mean", "segment.S5_4.1.conv1.bn.running_var", "segment.S5_4.1.dwconv.0.weight", "segment.S5_4.1.dwconv.1.weight", "segment.S5_4.1.dwconv.1.bias", "segment.S5_4.1.dwconv.1.running_mean", "segment.S5_4.1.dwconv.1.running_var", "segment.S5_4.1.conv2.0.weight", "segment.S5_4.1.conv2.1.weight", "segment.S5_4.1.conv2.1.bias", "segment.S5_4.1.conv2.1.running_mean", "segment.S5_4.1.conv2.1.running_var", "segment.S5_4.2.conv1.conv.weight", "segment.S5_4.2.conv1.bn.weight", "segment.S5_4.2.conv1.bn.bias", "segment.S5_4.2.conv1.bn.running_mean", "segment.S5_4.2.conv1.bn.running_var", "segment.S5_4.2.dwconv.0.weight", "segment.S5_4.2.dwconv.1.weight", "segment.S5_4.2.dwconv.1.bias", "segment.S5_4.2.dwconv.1.running_mean", "segment.S5_4.2.dwconv.1.running_var", "segment.S5_4.2.conv2.0.weight", "segment.S5_4.2.conv2.1.weight", "segment.S5_4.2.conv2.1.bias", "segment.S5_4.2.conv2.1.running_mean", "segment.S5_4.2.conv2.1.running_var", "segment.S5_4.3.conv1.conv.weight", "segment.S5_4.3.conv1.bn.weight", "segment.S5_4.3.conv1.bn.bias", "segment.S5_4.3.conv1.bn.running_mean", "segment.S5_4.3.conv1.bn.running_var", "segment.S5_4.3.dwconv.0.weight", "segment.S5_4.3.dwconv.1.weight", "segment.S5_4.3.dwconv.1.bias", "segment.S5_4.3.dwconv.1.running_mean", "segment.S5_4.3.dwconv.1.running_var", "segment.S5_4.3.conv2.0.weight", "segment.S5_4.3.conv2.1.weight", "segment.S5_4.3.conv2.1.bias", "segment.S5_4.3.conv2.1.running_mean", "segment.S5_4.3.conv2.1.running_var", "segment.S5_5.bn.weight", "segment.S5_5.bn.bias", "segment.S5_5.bn.running_mean", "segment.S5_5.bn.running_var", "segment.S5_5.conv_gap.conv.weight", "segment.S5_5.conv_gap.bn.weight", "segment.S5_5.conv_gap.bn.bias", "segment.S5_5.conv_gap.bn.running_mean", "segment.S5_5.conv_gap.bn.running_var", "segment.S5_5.conv_last.conv.weight", "segment.S5_5.conv_last.bn.weight", "segment.S5_5.conv_last.bn.bias", "segment.S5_5.conv_last.bn.running_mean", "segment.S5_5.conv_last.bn.running_var", "bga.left1.0.weight", "bga.left1.1.weight", "bga.left1.1.bias", "bga.left1.1.running_mean", "bga.left1.1.running_var", "bga.left1.2.weight", "bga.left2.0.weight", "bga.left2.1.weight", "bga.left2.1.bias", "bga.left2.1.running_mean", "bga.left2.1.running_var", "bga.right1.0.weight", "bga.right1.1.weight", "bga.right1.1.bias", "bga.right1.1.running_mean", "bga.right1.1.running_var", "bga.right2.0.weight", "bga.right2.1.weight", "bga.right2.1.bias", "bga.right2.1.running_mean", "bga.right2.1.running_var", "bga.right2.2.weight", "bga.conv.0.weight", "bga.conv.1.weight", "bga.conv.1.bias", "bga.conv.1.running_mean", "bga.conv.1.running_var", "head.conv.conv.weight", "head.conv.bn.weight", "head.conv.bn.bias", "head.conv.bn.running_mean", "head.conv.bn.running_var", "head.conv_out.1.weight", "head.conv_out.1.bias", "aux2.conv.conv.weight", "aux2.conv.bn.weight", "aux2.conv.bn.bias", "aux2.conv.bn.running_mean", "aux2.conv.bn.running_var", "aux2.conv_out.0.1.conv.weight", "aux2.conv_out.0.1.bn.weight", "aux2.conv_out.0.1.bn.bias", "aux2.conv_out.0.1.bn.running_mean", "aux2.conv_out.0.1.bn.running_var", "aux2.conv_out.1.weight", "aux2.conv_out.1.bias", "aux3.conv.conv.weight", "aux3.conv.bn.weight", "aux3.conv.bn.bias", "aux3.conv.bn.running_mean", "aux3.conv.bn.running_var", "aux3.conv_out.0.1.conv.weight", "aux3.conv_out.0.1.bn.weight", "aux3.conv_out.0.1.bn.bias", "aux3.conv_out.0.1.bn.running_mean", "aux3.conv_out.0.1.bn.running_var", "aux3.conv_out.1.weight", "aux3.conv_out.1.bias", "aux4.conv.conv.weight", "aux4.conv.bn.weight", "aux4.conv.bn.bias", "aux4.conv.bn.running_mean", "aux4.conv.bn.running_var", "aux4.conv_out.0.1.conv.weight", "aux4.conv_out.0.1.bn.weight", "aux4.conv_out.0.1.bn.bias", "aux4.conv_out.0.1.bn.running_mean", "aux4.conv_out.0.1.bn.running_var", "aux4.conv_out.1.weight", "aux4.conv_out.1.bias", "aux5_4.conv.conv.weight", "aux5_4.conv.bn.weight", "aux5_4.conv.bn.bias", "aux5_4.conv.bn.running_mean", "aux5_4.conv.bn.running_var", "aux5_4.conv_out.0.1.conv.weight", "aux5_4.conv_out.0.1.bn.weight", "aux5_4.conv_out.0.1.bn.bias", "aux5_4.conv_out.0.1.bn.running_mean", "aux5_4.conv_out.0.1.bn.running_var", "aux5_4.conv_out.1.weight", "aux5_4.conv_out.1.bias"]
unexpected keys: []
Traceback (most recent call last):
File "/home/xyh/BiSeNet-master/tools/train_amp.py", line 210, in
main()
File "/home/xyh/BiSeNet-master/tools/train_amp.py", line 206, in main
train()
File "/home/xyh/BiSeNet-master/tools/train_amp.py", line 159, in train
logits, logits_aux = net(im)
^^^^^^^
File "/home/xyh/anaconda3/envs/pytorch/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xyh/anaconda3/envs/pytorch/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xyh/anaconda3/envs/pytorch/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xyh/anaconda3/envs/pytorch/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xyh/BiSeNet-master/lib/models/bisenetv2.py", line 335, in forward
feat_head = self.bga(feat_d, feat_s)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xyh/anaconda3/envs/pytorch/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xyh/BiSeNet-master/lib/models/bisenetv2.py", line 277, in forward
left = left1 * torch.sigmoid(right1)
^
RuntimeError: The size of tensor a (50) must match the size of tensor b (52) at non-singleton dimension 3
Traceback (most recent call last):
File "/home/xyh/BiSeNet-master/tools/train_amp.py", line 210, in
main()
File "/home/xyh/BiSeNet-master/tools/train_amp.py", line 206, in main
train()
File "/home/xyh/BiSeNet-master/tools/train_amp.py", line 159, in train
logits, logits_aux = net(im)
^^^^^^^
File "/home/xyh/anaconda3/envs/pytorch/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xyh/anaconda3/envs/pytorch/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xyh/anaconda3/envs/pytorch/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xyh/anaconda3/envs/pytorch/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xyh/BiSeNet-master/lib/models/bisenetv2.py", line 335, in forward
feat_head = self.bga(feat_d, feat_s)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xyh/anaconda3/envs/pytorch/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xyh/BiSeNet-master/lib/models/bisenetv2.py", line 277, in forward
left = left1 * torch.sigmoid(right1)
^~~~~~~~~
RuntimeError: The size of tensor a (50) must match the size of tensor b (52) at non-singleton dimension 3
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 4017214) of binary: /home/xyh/anaconda3/envs/pytorch/bin/python
Traceback (most recent call last):
File "/home/xyh/anaconda3/envs/pytorch/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/xyh/anaconda3/envs/pytorch/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/xyh/anaconda3/envs/pytorch/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/xyh/anaconda3/envs/pytorch/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/xyh/anaconda3/envs/pytorch/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xyh/anaconda3/envs/pytorch/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train_amp.py FAILED

Failures:
[1]:
time : 2023-06-29_21:01:15
host : xmlg-PR4910W
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 4017215)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-06-29_21:01:15
host : xmlg-PR4910W
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 4017214)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

CoinCheung · Answer 1 · Mon Aug 07 2023 18:33:48 GMT+0800 (China Standard Time)

seems there are two problems:

you should not use --finetune-from to load a backbone checkpoint, it is used for a whole model checkpoint.
your image sizes are supposed to be devisible by 32, such as 768x512.

CoinCheung · Answer 2 · Mon Aug 07 2023 18:35:12 GMT+0800 (China Standard Time)

I am closing this because the title and description looks very ugly, leave new messages if you want more discussions.

tools/train_amp.py FAILED

Failures: [1]: time : 2023-06-29_21:01:15 host : xmlg-PR4910W rank : 1 (local_rank: 1) exitcode : 1 (pid: 4017215) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-06-29_21:01:15 host : xmlg-PR4910W rank : 0 (local_rank: 0) exitcode : 1 (pid: 4017214) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Failures:
[1]:
time : 2023-06-29_21:01:15
host : xmlg-PR4910W
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 4017215)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2023-06-29_21:01:15
host : xmlg-PR4910W
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 4017214)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html