zhanghang1989 / PyTorch-Encoding

A CV toolkit for my papers.

Home Page:https://hangzhang.org/PyTorch-Encoding/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to use the DataParallelCriterion ,DataParallelModel

lxtGH opened this issue · comments

I fllow the insturction in this page in my code.
However I got this error.
Traceback (most recent call last):
File "train_psp_resnet101.py", line 232, in
train(cfg)
File "train_psp_resnet101.py", line 124, in train
loss = loss_fn(outputs, labels)
File "/home/xiangtai/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(input, **kwargs)
File "/home/xiangtai/anaconda3/lib/python3.6/site-packages/encoding/parallel.py", line 134, in forward
outputs = _criterion_parallel_apply(replicas, inputs, targets, kwargs)
File "/home/xiangtai/anaconda3/lib/python3.6/site-packages/encoding/parallel.py", line 188, in _criterion_parallel_apply
raise output
File "/home/xiangtai/anaconda3/lib/python3.6/site-packages/encoding/parallel.py", line 163, in _worker
output = module(
(input + target), **kwargs)
TypeError: add() received an invalid combination of arguments - got (tuple), but expected one of:

  • (Tensor other, float alpha)
  • (float other, float alpha)

the outputs is a list,each item is a small tensor(each tensor is on different gpu), labels is not a list, but a tensor on gpu:0
@zhanghang1989

This issue still persists,
File "train.py", line 495, in train
main_los = criterion(main_loss, gts)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 468, in call
result = self.forward(input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/encoding/parallel.py", line 134, in forward
outputs = _criterion_parallel_apply(replicas, inputs, targets, kwargs)
File "/opt/conda/lib/python3.6/site-packages/encoding/parallel.py", line 188, in _criterion_parallel_apply
raise output
File "/opt/conda/lib/python3.6/site-packages/encoding/parallel.py", line 163, in _worker
output = module(
(input + target), **kwargs)
TypeError: add() received an invalid combination of arguments - got (tuple), but expected one of:

  • (Tensor other, Number alpha)
  • (Number other, Number alpha)

Same Problem, the outputs is a list,each item is a small tensor(each tensor is on different gpu), labels is not a list, but a tensor on gpu:0

@AssassinCroc make the output be tuple can solve this problem

@lxtGH How to make the output to be tuple? could you show me the exact solution?

I guess you mean do something like,

replace the output from a network like below,

Before,

return [x_dsn, x]

After

return (x_dsn, x)

@PkuRainBow Yes,just return x_dsn, x can also work

I meet the same error, and pytorch=0.4.1, while I try @lxtGH solution, but it persists.
And I found there are something wrong in parallel.py

I have fixed this error by the following:

In DataParallelCriterion.forward()

replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
targets = tuple(targets_per_gpu[0] for targets_per_gpu in targets)
outputs = _criterion_parallel_apply(replicas, inputs, targets, kwargs)

In _criterion_parallel_apply()

try:
        with torch.cuda.device(device):
                output = module(input, target)
        with lock:
                results[i] = output
except Exception as e:

It works for me, hope it helps!