kan-bayashi / PytorchWaveNetVocoder

WaveNet-Vocoder implementation with pytorch.

Home Page:https://kan-bayashi.github.io/WaveNetVocoderSamples/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some questions

julianzaidi opened this issue · comments

Hi, I have some questions concerning your code:

1 - In train_generator() function of train.py script (line 69 to 269), you create your batches using buffers x_buffer and h_buffer. You initialize them at the beginning of your code and then fill them with new audio / feature data. My question refers to lines 135-136 and 178-179 :

x_buffer = np.concatenate([x_buffer, x], axis=0)
h_buffer = np.concatenate([h_buffer, h], axis=0)

Initially, x_buffer and h_buffer are empty. However, by iterating on files, it is possible that x_buffer and h_buffer contains data from the previous wav file. In this case, you will concatenate data from two different audios, which can affect training quality. Is it voluntary ?

2 - In the same script, at lines 472 - 474, your loss doesn't take into account of the receptive field of the WaveNet. It may be because your shift size when creating batches is equal to batch_length and not batch_length + receptive_field, but I wanted to be sure of this choice concerning the loss calculation.

3 - Finally, this last question is open. Have you thought about training WaveNet on mel-spectrograms, like tacotron 2 paper ? Apparently, training on mel-spectrograms allow a better audio quality during synthesis. This maybe assume that you change your loss with a Mixture of logistic distributions (MoL), like WaveNet 2 paper

Hope this post will not bother you !

Cheers,

Julian

Hi Mr. Julian.
Thank you for your questions.

  1. Yes. I know the problem, but I think in the case of speaker-dependent model, it is not a serious problem. Also, each utterance has silence parts at the begging the end so that I think it does not affect the next utterance so much even if different utterances are concatenated. If you are worried about it, you can easily change to initialize the buffer every time or use utterance batch.

  2. Yes. We use batch_length + receptive_field samples as a batch not to take receptive field parts into account loss calculation.

  3. Now, we are trying to directly use stft spectrogram or mel spectrogram. It seems to be nice. (We do not care about F0 analysis error.) I will brush up my code to be able to select the feature type. And I also have tried MoL loss training, but it is a little bit difficult to train comparing with cross entropy loss (especially, in the case of small amount of the data). If I find the nice training setting, I will integrate it into this repository :)

Hello,
The questions are very nice.

  1. Finally someone also realizes this slight but, I think quite important thing.
    So, I tried to train using current buffering procedure. Then, using the trained model, I tried to measure the loss without optimizing the parameters, but with revised buffering, i.e., the buffer was initialized for each utterance (still using mini-batch samples).
    I found that the loss slightly increases, about 0.01~0.02 compared to the loss in training. It is also difficult to get the same training loss if using current buffering procedure because the starting initial samples will shift as the training epoch goes. Unless you count the samples shift until the number of training epoch that you want to use.

  2. If you use current buffering, the starting receptive_field samples in the first mini-batch will not be included in the optimization. The solution is to pad left the first mini-batch with zeros before feeding it into the network.
    If you change to initialize the buffer for every utterance, the starting receptive_field samples for each utterance will not be included. So, in this case, the solution is to pad left the first mini-batch for each utterance with zeros.
    Note that if you initialize the buffer for every utterance, you need to also handle the last remaining mini-batch samples for each utterance.

  3. I have tried to train using mel-spectrogram. The loss is much lower compared to using current auxiliary features set. The sounds are nice, but sometimes there are some artifact background sounds that can be perceived. I will try to post some samples.

Thanks @kan-bayashi and @patrickltobing for your answers.

1 - As @kan-bayashi said, it may be caused by the fact that audio contains silences at the beginning and the end, so incoherence caused by concatenation is smoothed. Since audio quality is very nice when using this training procedure, changing the code to be perfectly consistent may not bring a big improvement to the audio generation.

2 - @patrickltobing has a good point concerning the starting receptive_field that is not taken into account in the loss. I will try to see the impact of left padding.

3 - It's true that MoL loss must be used when a lot of examples are available in the training set. For the moment, only training on CMU Artic data is not sufficient enough, but making a big database may solve the problem.

Looking for the next commit concerning the mel-spectrogram training :)

Hi @julianzaidi.
I implemented mel-spectrogram recipe.
If you are interested, please try it.

Thanks for the feedback @kan-bayashi ! I have launched a training, code seems to work properly. The update with new pytorch version is really nice.

I have just one question concerning the pre-processing part of your code. You don't seem to use any noise shaping technic on your audios. You marked TODO in the corresponding section but why don't you use the same noise shaping technic than the one used with WORLD features ?

@julianzaidi Yes, same noise shaping technique can be applied. But current implementation uses MLSA filter which make filter from mel-cepstrum, so that it is necessary to modify the filter design part. Now I’m trying to implement it.

@kan-bayashi hi the result is very good ! but generate is slowly! do you plan implement the parallel wavenet! thanks!

@kan-bayashi that may be better and more coherent indeed :)

@maozhiqiang parallel WaveNet is a solution, but it requires a (very) good teacher WaveNet and a lot of computation / memory for training. Take a look at this NVIDIA github repo, it is for "standard" WaveNet and reaches real time generation.

@julianzaidi I implemented noise shaping with stft-based mcep in mel-spectrogram recipe. Now all of the recipes can use noise shaping technique :)

@maozhiqiang As @julianzaidi said, it requires a lot of resources, but it is worth to try. I want to implement.

Thanks a lot @kan-bayashi for all your work :) I will try this !

Thanks a lot @kan-bayashi