seorim0 / NUNet-TLS

Nested U-Net with two-level skip connections for speech enhancement

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

real_time

shenbuguanni opened this issue · comments

Firstly,thanks your work very much, I'm study the TFA module recently. I have some problems in the script. 1. the shape of ZF is [1 D] where the script is [T D] 2. the time_seq is the number that add before input x, so why is this done in real time and how the value (32 in script) of time_seq should be set and what it relates to? Thank you again!

Hello! Thank you for your interest in our research.

Yes, that's right. Causal TFA (CTFA) used in this repo is obtained by modifying the TFA (reference is mentioned in README). One major shortcoming of this TFA is that causality is not guaranteed since it operates per block of frames (because it is [1, D] as you mentioned). To suffice the causality necessary for real-time implementation, we modify the equation for time axis averaging so that it includes only the causal frames.

And in this case, time_seq (to be precise, time_seq-1) means the number of the look-back frames. The reason we add the time_seq before input x is to eliminate latency, and 32 was determined experimentally.

so the time_seq is the number of history frames used by nn.AvgPool1d()? I find causal conv2d script have the same operation (pad one frame before), so I set smaller time_seq will not affect causality but scan less historical information.
In addition, I would like to ask about the 1x1 conv, to my knowledge, the kernel_size is 1x1 but there is 3x1 in the script. I'm read some papers which add conv 1x1 to the skip connection (DCCRN+),So, the 1x1 conv can not change the input shape, I don't know if you've done the corresponding experiments and give me some useful advice.

so the time_seq is the number of history frames used by nn.AvgPool1d()? --> Yes!
so I set smaller time_seq will not affect causality --> what does this mean?

Could you please explain the question below in more detail? Sorry, I didn't understand the question.

so I set smaller time_seq will not affect causality --> it means pad any size(>0) of time_seq before input x, the system is causality.
1x1 conv used in skip connection ---> the kernel size is 1x1? I think the output's channel and freq should be the same of input because it will add to the decoder. if use 1x1 kernel size, what's the point of doing this operation. or there is other operation after 1x1 conv like BN and activate function?

First of all, I didn't use any convolutional filter for skip-connection.
What part did you see and say?

1x1 conv down_ sample where the kernel size is not 1x1.

down_sampling is not used as a filter for skip connections. It is the module used to encode. It has the same input and output channels and simply compresses the frequency axis in half for down-sampling.
If you don't understand, I recommend taking a look at the Nested Unet structure. References to this are also in the README.

And there is no particular reason why the kernel size is (3x1).