How to use the DataParallelCriterion ,DataParallelModel
lxtGH opened this issue · comments
I fllow the insturction in this page in my code.
However I got this error.
Traceback (most recent call last):
File "train_psp_resnet101.py", line 232, in
train(cfg)
File "train_psp_resnet101.py", line 124, in train
loss = loss_fn(outputs, labels)
File "/home/xiangtai/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(input, **kwargs)
File "/home/xiangtai/anaconda3/lib/python3.6/site-packages/encoding/parallel.py", line 134, in forward
outputs = _criterion_parallel_apply(replicas, inputs, targets, kwargs)
File "/home/xiangtai/anaconda3/lib/python3.6/site-packages/encoding/parallel.py", line 188, in _criterion_parallel_apply
raise output
File "/home/xiangtai/anaconda3/lib/python3.6/site-packages/encoding/parallel.py", line 163, in _worker
output = module((input + target), **kwargs)
TypeError: add() received an invalid combination of arguments - got (tuple), but expected one of:
- (Tensor other, float alpha)
- (float other, float alpha)
the outputs is a list,each item is a small tensor(each tensor is on different gpu), labels is not a list, but a tensor on gpu:0
@zhanghang1989
This issue still persists,
File "train.py", line 495, in train
main_los = criterion(main_loss, gts)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 468, in call
result = self.forward(input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/encoding/parallel.py", line 134, in forward
outputs = _criterion_parallel_apply(replicas, inputs, targets, kwargs)
File "/opt/conda/lib/python3.6/site-packages/encoding/parallel.py", line 188, in _criterion_parallel_apply
raise output
File "/opt/conda/lib/python3.6/site-packages/encoding/parallel.py", line 163, in _worker
output = module((input + target), **kwargs)
TypeError: add() received an invalid combination of arguments - got (tuple), but expected one of:
- (Tensor other, Number alpha)
- (Number other, Number alpha)
Same Problem, the outputs is a list,each item is a small tensor(each tensor is on different gpu), labels is not a list, but a tensor on gpu:0
@AssassinCroc make the output be tuple can solve this problem
@lxtGH How to make the output to be tuple? could you show me the exact solution?
I guess you mean do something like,
replace the output from a network like below,
Before,
return [x_dsn, x]
After
return (x_dsn, x)
@PkuRainBow Yes,just return x_dsn, x
can also work
I meet the same error, and pytorch=0.4.1, while I try @lxtGH solution, but it persists.
And I found there are something wrong in parallel.py
I have fixed this error by the following:
In DataParallelCriterion.forward()
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
targets = tuple(targets_per_gpu[0] for targets_per_gpu in targets)
outputs = _criterion_parallel_apply(replicas, inputs, targets, kwargs)
In _criterion_parallel_apply()
try:
with torch.cuda.device(device):
output = module(input, target)
with lock:
results[i] = output
except Exception as e:
It works for me, hope it helps!