Strange behavior using PyTorch DDP

Question

Strange behavior using PyTorch DDP

snakers4 opened this issue 3 years ago · comments

Alexander Veysov commented 3 years ago

@1ytic
Hi,

So far I have been able to use the loss with DDP on a single GPU , it behaves more or less as expected.

But when I use more than 1 device, the following happens:

On GPU-0 loss is calculated properly
On GPU-1 loss is close to zero for each batch

I checked the input tensors, devices, tensor values, etc - so far everything seems to be identical for GPU-0 and other GPUs.

Alexander Veysov · Answer 1 · Thu Jan 13 2022 19:23:21 GMT+0800 (China Standard Time)

@burchim
By the way, since you used this loss, did you encounter anything of this sort in your work?

Maxime Burchi · Answer 2 · Thu Jan 13 2022 20:13:12 GMT+0800 (China Standard Time)

Hi @snakers4!
Yes I had a similar problem with 4 GPU devices where the rnnt loss was properly computed on the first devices but 0 on the others. I don't really remember what was the exact cause but it had something to with tensor devices. Maybe the frames / label lengths.

I also recently experimented replacing it with the official torchaudio.transforms.RNNTLoss loss from torchaudio 0.10.0.
Was working very well but I didn't try to do a full training with it.

Alexander Veysov · Answer 3 · Thu Jan 13 2022 20:17:25 GMT+0800 (China Standard Time)

Thanks for the heads up about the torchaudio loss!
I remember seeing it sometime ago, but I totally forgot about it.

Alexander Veysov · Answer 4 · Thu Jan 13 2022 20:45:57 GMT+0800 (China Standard Time)

@burchim
By the way, did you have RuntimeError: input length mismatch when migrating from warp-rnnt towards torchaudio?

Maxime Burchi · Answer 5 · Thu Jan 13 2022 20:52:43 GMT+0800 (China Standard Time)

Yes, this means that logits / target lengths tensors do not match the logits / target tensors.
If you have logits lengths longer than your logits tensor for instance.

Maxime Burchi · Answer 6 · Thu Jan 13 2022 20:55:11 GMT+0800 (China Standard Time)

Because I used the targets lengths instead of logits lengths, stupid error

Fangjun Kuang · Answer 7 · Sat May 14 2022 22:27:06 GMT+0800 (China Standard Time)

Thanks for the heads up about the torchaudio loss!

@snakers4
You may find https://github.com/danpovey/fast_rnnt useful.