rishikksh20 / Fre-GAN-pytorch

Fre-GAN: Adversarial Frequency-consistent Audio Synthesis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inconsistency with paper

CookiePPP opened this issue · comments

https://arxiv.org/pdf/2106.02297.pdf

In section 2.3
"After each level of DWT, all the frequency sub-bands are channel-wise concatenated and passed to convolutional layers"


if i == 0:
x = torch.cat([x, x_d1], dim=2)
if i == 1:
x = torch.cat([x, x_d2], dim=2)
i = i + 1

You are concatenating on the length dim resulting in an odd looking tensor where the first half is audio features and the 2nd half is DWT features, and local waveform/DWT information can't mix properly.

Is there any reason for this? I feel very confused looking at this, but you've done it twice so I assume there's some reason for this.

Hey @CookiePPP, For my understanding the sentence "After each level of DWT, all the frequency sub-bands are channel-wise concatenated and passed to convolutional layers" should refer to the lines:

# DWT 1
x_d1_high1, x_d1_low1 = self.dwt1d(x)
x_d1 = self.dwt_conv1(torch.cat([x_d1_high1, x_d1_low1], dim=1))
# DWT 2
x_d2_high1, x_d2_low1 = self.dwt1d(x_d1_high1)
x_d2_high2, x_d2_low2 = self.dwt1d(x_d1_low1)
x_d2 = self.dwt_conv2(torch.cat([x_d2_high1, x_d2_low1, x_d2_high2, x_d2_low2], dim=1))

@leminhnguyen
Thanks!
Yes I see. I wonder if "and passed to convolutional layers" would have meant channel-wise concat or length wise. 🤔

I suppose the difference shouldn't be large in terms of quality, just maybe an increase in compute/training time from having the discriminator latents get longer after every layer.