asteroid-team / torch-audiomentations

Fast audio data augmentation in PyTorch. Inspired by audiomentations. Useful for deep learning.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

a multi-gpu bug

DingJuPeng1 opened this issue · comments

If I use the batchbins in espnet, It will trigger a multi-GPU bug.
for example, if I use two GPUs, and the final batch_size is 61, and I use data parallel, it will divide into 30, 31,
when I thy to use torch-audiomentations, it will trigger a bug as follow.
image

whether The batch size of each card must be the same or there can be other solutions to avoid this bug.

looking forward to a reply

Could you provide a snippet of code that reproduces the problem? If possible, make one that doesn't need a multi-gpu setup to reproduce it, as I don't have such a setup available at the moment

I apologize for not being particularly easy to get a small piece of reproducible code, because this code is based on espnet which is wrapped.
when I use single Gpu to run same code, It is normal, but when I use two GPU with dataParallel it will trigger this problem, it seems like it mixes the two sample size in different batches than lead to shape dismatch.
This problem will occur when I run this code:
image
"self.transforms" include "ApplyImpulseResponse" and "AddBackgroundNoise"
image

Ok, but without a code example and a multi-gpu setup I won't be able to reproduce the bug at the moment.

Does this bug apply to all transforms, or just ApplyImpulseResponse and/or AddBackgroundNoise?

Is there a way you can work around it? Like always give it a batch size that is divisible by your number of GPUs? A fixed batch size should do the trick.

Or would you like to make a PR that fixes the bug?

Or maybe I should slap a known limitation on readme that says multi-GPU with "uneven" batch sizes isn't officially supported?

When I Use batch_sizes which can be divible by the number of GPUs, it can work normally. It seems like occur when there are uneven batch sizes in different GPUs. If you can't reproduce the bug and fix that, I think I can have a try to fix this bug by myself at first.