multi-gpu failed in train_localization.py
seyedmajidazimi opened this issue · comments
unlike multi-gpu training using "train.py", in localization training using "train_localization.py", I am facing the following issue when trying to use multi-gpu. With single gpu, it is working, but I am running out of memory using only one Gefore 2080ti RTX gpu.
CUDA_VISIBLE_DEVICES=3,4 python train_localization.py --folds-csv folds.csv --config configs/se50_loc.json --logdir logs --predictions predictions --data-dir /datasets/xView2/train_tier3_combined --gpu 3,4 --output-dir weights
bottleneck 1280 256
bottleneck 704 192
bottleneck 384 128
bottleneck 128 64
Selected optimization level O0: Pure FP32 training.
Defaults for this optimization level are:
enabled : True
opt_level : O0
cast_model_type : torch.float32
patch_torch_functions : False
keep_batchnorm_fp32 : None
master_weights : False
loss_scale : 1.0
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O0
cast_model_type : torch.float32
patch_torch_functions : False
keep_batchnorm_fp32 : None
master_weights : False
loss_scale : dynamic
Freezing encoder!!!
0%| | 0/1266 [00:09<?, ?it/s]
Traceback (most recent call last):
File "train_localization.py", line 302, in <module>
main()
File "train_localization.py", line 191, in main
args.local_rank)
File "train_localization.py", line 268, in train_epoch
out_mask = model(imgs)
File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/apex/amp/_initialize.py", line 197, in new_fwd
**applier(kwargs, input_caster))
File "/space/export/data/azim_se/xView2_second_place/models/unet.py", line 390, in forward
x = stage(x)
File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/amajid/anaconda3/envs/xV2-con36-cu90-tch110-tv030/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 338, in forward
self.padding, self.dilation, self.groups)
RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)
DDP version worked, it should be run with the following params in your case CUDA_VISIBLE_DEVICES=3,4 python -u -m torch.distributed.launch --nproc_per_node=2 --master_port 9901 train_localization.py --distributed --folds-csv folds.csv --config configs/se50_loc.json --logdir logs --predictions predictions --data-dir /datasets/xView2/train_tier3_combined --output-dir weights