AssertionError: Default process group is not initialized
amiltonwong opened this issue · comments
Hi, authors,
I got the following error after executing command: python tools/train.py configs/SETR/SETR_PUP_768x768_40k_cityscapes_bs_8.py
2021-04-08 08:03:22,265 - mmseg - INFO - Loaded 2975 images
2021-04-08 08:03:24,275 - mmseg - INFO - Loaded 500 images
2021-04-08 08:03:24,276 - mmseg - INFO - Start running, host: root@milton-LabPC, work_dir: /media/root/mdata/data/code13/SETR/work_dirs/SETR_PUP_768x768_40k_cityscapes_bs_8
2021-04-08 08:03:24,276 - mmseg - INFO - workflow: [('train', 1)], max: 40000 iters
Traceback (most recent call last):
File "tools/train.py", line 161, in <module>
main()
File "tools/train.py", line 150, in main
train_segmentor(
File "/media/root/mdata/data/code13/SETR/mmseg/apis/train.py", line 106, in train_segmentor
runner.run(data_loaders, cfg.workflow, cfg.total_iters)
File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 130, in run
iter_runner(iter_loaders[i], **kwargs)
File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 67, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/media/root/mdata/data/code13/SETR/mmseg/models/segmentors/base.py", line 152, in train_step
losses = self(**data_batch)
File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func
return old_func(*args, **kwargs)
File "/media/root/mdata/data/code13/SETR/mmseg/models/segmentors/base.py", line 122, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/media/root/mdata/data/code13/SETR/mmseg/models/segmentors/encoder_decoder.py", line 157, in forward_train
loss_decode = self._decode_head_forward_train(x, img_metas,
File "/media/root/mdata/data/code13/SETR/mmseg/models/segmentors/encoder_decoder.py", line 100, in _decode_head_forward_train
loss_decode = self.decode_head.forward_train(x, img_metas,
File "/media/root/mdata/data/code13/SETR/mmseg/models/decode_heads/decode_head.py", line 185, in forward_train
seg_logits = self.forward(inputs)
File "/media/root/mdata/data/code13/SETR/mmseg/models/decode_heads/vit_up_head.py", line 93, in forward
x = self.syncbn_fc_0(x)
File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 519, in forward
world_size = torch.distributed.get_world_size(process_group)
File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 625, in get_world_size
return _get_group_size(group)
File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size
_check_default_pg()
File "/root/anaconda3/envs/pytorch1.7.0/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg
assert _default_pg is not None, \
AssertionError: Default process group is not initialized
(pytorch1.7.0) root@milton-LabPC:/data/code13/SETR
As I use a single GPU device to perform the training, it seems the error is related to distributed training. Any hints to solve this issue?
THX!
This is due to the sync bn. Try
./tools/dist_train.sh configs/SETR/SETR_PUP_768x768_40k_cityscapes_bs_8.py 1
but it won't help to train SETR with only one GPU
Does it mean that GPU device >=2 are required to run the training step?
yes. I didn't see any modern segmentation model can be run on a single gpu. see mmsegmentation