多gpu训练的时候改变run.sh里的ngpu的值并不能有效果，这是为啥

Question

多gpu训练的时候改变run.sh里的ngpu的值并不能有效果，这是为啥

counter0 opened this issue 5 years ago · comments

counter0 commented 5 years ago

Kaituo XU 许开拓 · Answer 1 · Mon Mar 04 2019 13:31:46 GMT+0800 (China Standard Time)

那个ngpu默认设为1，改为其他数字没有影响
多GPU暂时还不支持，如果有需要的话，可以使用PyTorch的nn.DataParallel()

ChenPeng · Answer 2 · Fri Jul 12 2019 14:09:46 GMT+0800 (China Standard Time)

我按照网上的方法加上DataPallel后，报Gather got an input of invalid size: got [10,35,9961], but expected [10,26,9961] 的错误，没找到啥问题，请问您知道这大概是啥问题导致的吗？

Kaituo XU 许开拓 · Answer 3 · Mon Jul 15 2019 10:17:21 GMT+0800 (China Standard Time)

@chenpe32cp 设置batch_first了吗

ChenPeng · Answer 4 · Mon Jul 15 2019 16:38:41 GMT+0800 (China Standard Time)

请问batch_first这个参数应该在什么地方设置呢？

…

---原始邮件--- 发件人: "Kaituo XU 许开拓"<notifications@github.com> 发送时间: 2019年7月15日(星期一) 上午10:17 收件人: "kaituoxu/Speech-Transformer"<Speech-Transformer@noreply.github.com>; 抄送: "Mention"<mention@noreply.github.com>;"ChenPeng"<chenpeng0538@qq.com>; 主题: Re: [kaituoxu/Speech-Transformer] 多gpu训练的时候改变run.sh里的ngpu的值并不能有效果，这是为啥 (#2) @chenpe32cp 设置batch_first了吗 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Kaituo XU 许开拓 · Answer 5 · Mon Jul 15 2019 22:28:54 GMT+0800 (China Standard Time)

说错了，应该是DataParallel的dim这个参数：https://pytorch.org/docs/1.0.0/nn.html?highlight=dataparallel#torch.nn.DataParallel

ChenPeng · Answer 6 · Tue Jul 16 2019 10:10:38 GMT+0800 (China Standard Time)

说错了，应该是DataParallel的dim这个参数：https://pytorch.org/docs/1.0.0/nn.html?highlight=dataparallel#torch.nn.DataParallel
这个参数我调成1试过，但是输入数据的形式是(batch_size,T,D),也就是对应DataParallel中的dim=0啊。现在我怀疑出错的地方应该是，将一个batch的数据分到不同的GPU上时，这时每个GPU对应的max_len是不一样的，所以导致最终合并的时候报错，但还没找到解决方案，囧。。。

ChenPeng · Answer 7 · Wed Jul 17 2019 16:26:58 GMT+0800 (China Standard Time)

#torch.nn.DataParallel
当使用两块GPU时，我定位到bug是DataParallel在将batch_size=18的数据分为两部分时，每一块分别为9，但是两块GPU的序列长度不一样，分别是35和27，这个请问该如何解决呢？,
padded_input's shape: torch.Size([18, 292, 320])
input_lengths's shape: torch.Size([18])
padded_target's shape: torch.Size([18, 34])
max_len: 35
pad: torch.Size([9, 35])
max_len: 35
pad: torch.Size([9, 35])
max_len: 27
pad: torch.Size([9, 27])
max_len: 27
pad: torch.Size([9, 27])
Traceback (most recent call last):
File "/home/cp/Speech-Transformer-master/egs/aishell/../../src/bin/train.py", line 168, in
main(args)
File "/home/cp/Speech-Transformer-master/egs/aishell/../../src/bin/train.py", line 162, in main
solver.train()
File "/home/cp/Speech-Transformer-master/src/solver/solver.py", line 82, in train
tr_avg_loss = self._run_one_epoch(epoch)
File "/home/cp/Speech-Transformer-master/src/solver/solver.py", line 169, in _run_one_epoch
pred, gold = self.model(padded_input, input_lengths, padded_target)
File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 124, in forward
return self.gather(outputs, self.output_device)
File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 136, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
return gather_map(outputs)
File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 65, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/cuda/comm.py", line 160, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: Gather got an input of invalid size: got [9, 35, 9961], but expected [9, 27, 9961] (gather at torch/csrc/cuda/comm.cpp:183)

counter0 · Answer 8 · Wed Jul 17 2019 16:31:07 GMT+0800 (China Standard Time)

我用的别的方法进行多gpu训练的，用了horovod，你可以研究下，改动代码不多

ChenPeng · Answer 9 · Wed Jul 17 2019 16:45:05 GMT+0800 (China Standard Time)

我用的别的方法进行多gpu训练的，用了horovod，你可以研究下，改动代码不多
能不能分享一下你的多GPU训练部分的代码，让我参考一下啊，我调试了好久还没调出来

Kaituo XU 许开拓 · Answer 10 · Wed Jul 17 2019 16:53:24 GMT+0800 (China Standard Time)

说错了，应该是DataParallel的dim这个参数：https://pytorch.org/docs/1.0.0/nn.html?highlight=dataparallel#torch.nn.DataParallel
这个参数我调成1试过，但是输入数据的形式是(batch_size,T,D),也就是对应DataParallel中的dim=0啊。现在我怀疑出错的地方应该是，将一个batch的数据分到不同的GPU上时，这时每个GPU对应的max_len是不一样的，所以导致最终合并的时候报错，但还没找到解决方案，囧。。。

Here is an example solution: https://github.com/kaituoxu/Listen-Attend-Spell/blob/master/src/models/encoder.py#L34-L42

xdcesc · Answer 11 · Thu Aug 22 2019 16:52:35 GMT+0800 (China Standard Time)

我用的别的方法进行多gpu训练的，用了horovod，你可以研究下，改动代码不多

Could you please share the code?