bigpon / QPPWG

Quasi-Periodic Parallel WaveGAN Pytorch implementation

Home Page:https://bigpon.github.io/QuasiPeriodicParallelWaveGAN_demo/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Questions about implementations

xwaeaewcrhomesysplug opened this issue · comments

First of all thank you for creating those videos and articles explaining the details.
It is very useful for reference.

However after reading and looking for a while I still cannot confirm some detail.
1)Is the Acoustic feature in the generator,1d or 2d.
What is it,mel spectogram extracted from natural speech?Or text to mel spectogram(from other framework?)
I noticed there is some notes you used like:Conditioned on 1×F0.
From what i have seen it is like a processed mel spectogram.But i cannot confirm.

2)How to calculate pitch dependent dilated factor.
From the video and paper I see the explainations and the derivations of it.
It is from DCNN.There is a equation for it.just change the definition of d to be a runtime calculate variable
d'=1*ET.How to calculate ET,or pitch dependent dilated factor?
I think you mention it has some properties relates to the wave frequency and periodic,but i cannot visualize it.

Some inherited reference used details.
I know you specifically mentioned most of the changes you made if not all.
However,I am kinda dumb and are uncertain about some details.So just to ask it ahead of time to not fail.

1)The residual block.From the paper diagram is it also like quasi periodic?like adaptive/fixed?
Is it just a no edit copy from P.W.G.?
How does the residual block affect the generator?

2)Is the generated speech and discriminator exactly the same as P.W.G.?Just to confirm.
If it is the same then I will find PWG implementations and study on it.

After all these question you may be curious.Why need to know.Why not just clone repo setup it.
Sadly,I am kinda bored and decided to make it consistent and not dependent on python libraries.
So,I am porting it to java or javacpp and yes sadly i need to implement practically everything other than maybe FFTW or matrix calcs.
Any of your suggestions and time taken for reply is greatly appreciated and I hope you can have a nice day.

1)Is the Acoustic feature in the generator,1d or 2d.
What is it,mel spectogram extracted from natural speech?Or text to mel spectogram(from other framework?)
I noticed there is some notes you used like:Conditioned on 1×F0.
From what i have seen it is like a processed mel spectogram.But i cannot confirm.

The acoustic features are extracted by WORLD (python: https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder, C: https://github.com/mmorise/World).
WORLD encodes speech into spectral envelope (sp), pitch (F0), and aperiodicity (ap) features.
I further transfer the sp into mel-cepstral by SPTK (python: https://github.com/r9y9/pysptk)
Therefore, I can directly control the pitch using scaled F0 (e.g. 1xF0)

2)How to calculate pitch dependent dilated factor.
From the video and paper I see the explainations and the derivations of it.
It is from DCNN.There is a equation for it.just change the definition of d to be a runtime calculate variable
d'=1*ET.How to calculate ET,or pitch dependent dilated factor?
I think you mention it has some properties relates to the wave frequency and periodic,but i cannot visualize it.

Please refer to equation (7) in "https://arxiv.org/pdf/2007.05663.pdf"
ET = Fs/(F0 x a)
'Fs' is the sampling rate, 'F0' is the pitch of the current frame, and 'a' is a dense factor (constant, I set it 4 in the repo).
For the meaning of 'a', please refer to fig. 4.

1)The residual block.From the paper diagram is it also like quasi periodic?like adaptive/fixed?
Is it just a no edit copy from P.W.G.?
How does the residual block affect the generator?

Yes, the fixed residual block is the same as the residual block in Parallel WaveGAN.
The adaptive residual block only changes the DCNN layer to the PDCNN layer.

2)Is the generated speech and discriminator exactly the same as P.W.G.?Just to confirm.
If it is the same then I will find PWG implementations and study on it.

The discriminator is exactly the same as ParallelWaveGAN, and I only change the generator of Parallel WaveGAN.

If you have any questions, please feel free to ask me.

I tried re implementing it into java.
so i did some basic syntax translation.
However in java the static type is kinda harsh since python does not have hard static type.
The code is all var with hints given by dev.
But generally the output is a tensor.By tensor there are many rank of tensor.
Generally,dimensions.So can you help me deduce the dimensions for a few functions in some of the code?

forward function in FixedResidualBlock or fixed block.The x and s.output
forward in AdaptiveResidualBlock.The x and s.output

these both are output of the conv1x1_skip variable which is also Conv1d1x1(x,y,bias) output.
and yes i do not know the output of it.it seems like a class that create a new conv1d1x1 obj.

ResidualParallelWaveGANDiscriminator forward function,by(b,1,t) can i assume it is 3d?

Stretch2d,again the forward function have a tensor.

Of course there are many others but I think others should be deducible with these initial info.
Any suggestion to quickly get the infos?I suppose one of the way is to full setup edit the code to print info but.
I think making it work will probably take some time.Not sure about the setup needed(need to check).

For the forward function in each class, I have provided the information about the input/output tensor. For example, in the "AdaptiveBlock" class in the "residual_block.py", I have provided the following information
"""Calculate forward propagation.

    Args:
        xC (Tensor): Current input tensor (B, residual_channels, T).
        xP (Tensor): Past input tensor (B, residual_channels, T).
        xF (Tensor): Future input tensor (B, residual_channels, T).
        c (Tensor): Local conditioning auxiliary tensor (B, aux_channels, T).

    Returns:
        Tensor: Output tensor for residual connection (B, residual_channels, T).
        Tensor: Output tensor for skip connection (B, skip_channels, T).

"""
(B, residual_channels, T) denotes a 3-dimensional tensor. B is batch size and T is data length.

For those native functions from PyTorch such as "Conv1d", I may not provide the input/output information.
That information can be found on the PyTorch website. (https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html)
However, I think it is possible to find some open-source implementation already porting PyTorch into Java.

Ah,I see it thanks.
I can (B, residual_channels, T)as 3d tensor right?
In which the B is a input data stream or array,or list. and T is a tensor or some object?

I am also quite curious on you take of porting.From your view,what would be a ideal library to use as porting.

I knew djl and dl4j both have some pytorch support.
Where djl is specialized in supporting the native c lib to java.

I currently just uses the jblas and apache common fast maths lib these two lib for porting it.
Which is honestly quite stupid,and are considering to utilize wheels invented to cut some porting time.

Yes, (B, residual_channels, T) denotes a 3d tensor.
B->mini batch_size.
T->data length.
Take the training process as an example, according to the config file (https://github.com/bigpon/QPPWG/blob/master/egs/vcc18/conf/vcc18.QPPWGaf_20.yaml), we know that the batch size is 6 and the batch length is 25520.
Therefore, the "B" is 6 and the "T" is 25520 for the training tensor in the training stage.
"xC" will be a 6 x 64 x 25520 tensor.

I never tried to port PyTorch into Java, so I have no experience with using these libraries.
Sorry, I don't have any informative comments on these libraries.

thanks for you infomation and guidance these few days.

I am considering to use DJL to port.
Will come back to you when I find bigger problem.
Or Maybe when I finished porting it.^^