kaituoxu / Speech-Transformer

A PyTorch implementation of Speech Transformer, an End-to-End ASR with Transformer network on Mandarin Chinese.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

多gpu训练的时候改变run.sh里的ngpu的值并不能有效果,这是为啥

counter0 opened this issue · comments

多gpu训练的时候改变run.sh里的ngpu的值并不能有效果,这是为啥
  1. 那个ngpu默认设为1,改为其他数字没有影响
  2. 多GPU暂时还不支持,如果有需要的话,可以使用PyTorch的nn.DataParallel()

我按照网上的方法加上DataPallel后,报Gather got an input of invalid size: got [10,35,9961], but expected [10,26,9961] 的错误,没找到啥问题,请问您知道这大概是啥问题导致的吗?

@chenpe32cp 设置batch_first了吗

说错了,应该是DataParallel的dim这个参数:https://pytorch.org/docs/1.0.0/nn.html?highlight=dataparallel#torch.nn.DataParallel
这个参数我调成1试过,但是输入数据的形式是(batch_size,T,D),也就是对应DataParallel中的dim=0啊。现在我怀疑出错的地方应该是,将一个batch的数据分到不同的GPU上时,这时每个GPU对应的max_len是不一样的,所以导致最终合并的时候报错,但还没找到解决方案,囧。。。

#torch.nn.DataParallel
当使用两块GPU时,我定位到bug是DataParallel在将batch_size=18的数据分为两部分时,每一块分别为9,但是两块GPU的序列长度不一样,分别是35和27,这个请问该如何解决呢?,
padded_input's shape: torch.Size([18, 292, 320])
input_lengths's shape: torch.Size([18])
padded_target's shape: torch.Size([18, 34])
max_len: 35
pad: torch.Size([9, 35])
max_len: 35
pad: torch.Size([9, 35])
max_len: 27
pad: torch.Size([9, 27])
max_len: 27
pad: torch.Size([9, 27])
Traceback (most recent call last):
File "/home/cp/Speech-Transformer-master/egs/aishell/../../src/bin/train.py", line 168, in
main(args)
File "/home/cp/Speech-Transformer-master/egs/aishell/../../src/bin/train.py", line 162, in main
solver.train()
File "/home/cp/Speech-Transformer-master/src/solver/solver.py", line 82, in train
tr_avg_loss = self._run_one_epoch(epoch)
File "/home/cp/Speech-Transformer-master/src/solver/solver.py", line 169, in _run_one_epoch
pred, gold = self.model(padded_input, input_lengths, padded_target)
File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 124, in forward
return self.gather(outputs, self.output_device)
File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 136, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
return gather_map(outputs)
File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 54, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 65, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/cp/application/miniconda3/lib/python3.6/site-packages/torch/cuda/comm.py", line 160, in gather
return torch._C._gather(tensors, dim, destination)
RuntimeError: Gather got an input of invalid size: got [9, 35, 9961], but expected [9, 27, 9961] (gather at torch/csrc/cuda/comm.cpp:183)

我用的别的方法进行多gpu训练的,用了horovod,你可以研究下,改动代码不多

我用的别的方法进行多gpu训练的,用了horovod,你可以研究下,改动代码不多
能不能分享一下你的多GPU训练部分的代码,让我参考一下啊,我调试了好久还没调出来

说错了,应该是DataParallel的dim这个参数:https://pytorch.org/docs/1.0.0/nn.html?highlight=dataparallel#torch.nn.DataParallel
这个参数我调成1试过,但是输入数据的形式是(batch_size,T,D),也就是对应DataParallel中的dim=0啊。现在我怀疑出错的地方应该是,将一个batch的数据分到不同的GPU上时,这时每个GPU对应的max_len是不一样的,所以导致最终合并的时候报错,但还没找到解决方案,囧。。。

Here is an example solution: https://github.com/kaituoxu/Listen-Attend-Spell/blob/master/src/models/encoder.py#L34-L42

我用的别的方法进行多gpu训练的,用了horovod,你可以研究下,改动代码不多

Could you please share the code?