How to use the DataParallelCriterion ,DataParallelModel

Question

How to use the DataParallelCriterion ,DataParallelModel

lxtGH opened this issue 6 years ago · comments

I fllow the insturction in this page in my code.
However I got this error.
Traceback (most recent call last):
File "train_psp_resnet101.py", line 232, in
train(cfg)
File "train_psp_resnet101.py", line 124, in train
loss = loss_fn(outputs, labels)
File "/home/xiangtai/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(input, **kwargs)
File "/home/xiangtai/anaconda3/lib/python3.6/site-packages/encoding/parallel.py", line 134, in forward
outputs = _criterion_parallel_apply(replicas, inputs, targets, kwargs)
File "/home/xiangtai/anaconda3/lib/python3.6/site-packages/encoding/parallel.py", line 188, in _criterion_parallel_apply
raise output
File "/home/xiangtai/anaconda3/lib/python3.6/site-packages/encoding/parallel.py", line 163, in _worker
output = module((input + target), **kwargs)
TypeError: add() received an invalid combination of arguments - got (tuple), but expected one of:

(Tensor other, float alpha)
(float other, float alpha)

the outputs is a list,each item is a small tensor(each tensor is on different gpu), labels is not a list, but a tensor on gpu:0
@zhanghang1989

AssassinCroc · Answer 1 · Thu Jun 21 2018 06:11:08 GMT+0800 (China Standard Time)

This issue still persists,
File "train.py", line 495, in train
main_los = criterion(main_loss, gts)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 468, in call
result = self.forward(input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/encoding/parallel.py", line 134, in forward
outputs = _criterion_parallel_apply(replicas, inputs, targets, kwargs)
File "/opt/conda/lib/python3.6/site-packages/encoding/parallel.py", line 188, in _criterion_parallel_apply
raise output
File "/opt/conda/lib/python3.6/site-packages/encoding/parallel.py", line 163, in _worker
output = module((input + target), **kwargs)
TypeError: add() received an invalid combination of arguments - got (tuple), but expected one of:

(Tensor other, Number alpha)
(Number other, Number alpha)

Same Problem, the outputs is a list,each item is a small tensor(each tensor is on different gpu), labels is not a list, but a tensor on gpu:0

Xiangtai Li · Answer 2 · Thu Jun 21 2018 09:31:52 GMT+0800 (China Standard Time)

@AssassinCroc make the output be tuple can solve this problem

Researcher.YuanYuhui · Answer 3 · Sat Sep 01 2018 14:49:56 GMT+0800 (China Standard Time)

@lxtGH How to make the output to be tuple? could you show me the exact solution?

I guess you mean do something like,

replace the output from a network like below,

Before,

return [x_dsn, x]

After

return (x_dsn, x)

Xiangtai Li · Answer 4 · Sun Sep 02 2018 00:11:26 GMT+0800 (China Standard Time)

@PkuRainBow Yes，just return x_dsn, x can also work

Songyang Zhang · Answer 5 · Mon Sep 03 2018 22:55:29 GMT+0800 (China Standard Time)

I meet the same error, and pytorch=0.4.1, while I try @lxtGH solution, but it persists.
And I found there are something wrong in parallel.py

I have fixed this error by the following:

In DataParallelCriterion.forward()

replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
targets = tuple(targets_per_gpu[0] for targets_per_gpu in targets)
outputs = _criterion_parallel_apply(replicas, inputs, targets, kwargs)

In _criterion_parallel_apply()

try:
        with torch.cuda.device(device):
                output = module(input, target)
        with lock:
                results[i] = output
except Exception as e:

It works for me, hope it helps!