kensun0 / Parallel-Wavenet

It is a Tutorial, not a complete implement

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The IAF?

zhf459 opened this issue · comments

did you consider the IAF(Inverse Autoregressive Flow)? the paper said the student use the iaf to generate wave in a parallelized way.

Yes, I think it is IAF now.

@kensun0 ,can you explain more details?It seems no mu_t and scale_t as output of the original wavenet.What's the z nosie looks like ,i think it works like a autoencoder(in autogressive way),so the z just sample from logistic(0,1),and have the same shape like input X and encoding?thank you very much

original wavenet output 256 softmax scores as classification. The parallel paper said that "Since training a 65,536-way categorical distribution would be prohibitively costly, we instead modelled the samples with the discretized mixture of logistics distribution introduced in [23]." So, mu_t and scale_t are from [23].

If we get z in autogressive way, we can't generate wave in parallel way. Right?
I think z and x have the same shape, before x+enc,we must upsampling encoding to the shape of x.

@kensun0 oh,i see ,there will out put 3 parameters mixture of logistics distribution :pi_t,mu_t,scale_t [pixcelcnn++]? i still confused about how to generate wave:sampe z noise and it will generate wave parallel,what's the output shape,will you share your code ?I can't wait to see the details.

Yes, if we use one mixture, we can remove pi_t.
Sorry, I won't share my codes.
The shape of output is as the same as z.

@kensun0 very nice of you, thank you!

@zhf459 I think is that when you use logistic mixture model,at the first flow end you sample wave like result as the input of the next input ,and so on until the last flow you will get better sample wave,but when you use categorical distribution we just need one flow at the end just make the loss between the teacher and the student drop?
i donot know if i understand it right.The IAF source code of open ai seems difficult for me to understand,Will we have to use all of the souce code of the original IAF,there are too much codes?maye we can work together to complete it

@kensun0 hi, since the paper said the student wavenet don't have skip connection layer ,so what's the last layer, and there 4 iaf layers with size=[10,10,10,30], each iaf layer is a simplified wavenet?

the last layer output the parameters of logistic distribution, its shape is [wav length, channels].
if you use one mixture, channels=2, there are mu_tot and scale_tot.
yes, each iaf is a wavenet.

@kensun0 ,I use the original last layer with one mixture output in student while 10-mixture logistic in teacher,is that ok? how's your final result,can you upload some samples?

That is OK. I also do that.

Ok,i will try again

@kensun0 hi, how do you calculate the power loss? I use the following code but get very large loss, how can i fix this:

def get_power_loss(sample_, x_):
    batch = sample_.shape[0]
    s = 0
    for i in range(batch):
        ss = np.abs(librosa.stft(sample_[i][0])) ** 2 - np.abs(librosa.stft(x_[i][0])) ** 2
        s += np.sum(ss ** 2)
    return s / batch

@zhf459 i have test power_loss and it works right,but i do not know how to complete the crossentropy loss,have you try it?

@jiqizaisikao what do you mean by it works right, did it works in pw? I try some ways to calculate the kl loss ,but I have no idea weather it work or not.

wav = tf.contrib.signal.stft(wav,512,256,fft_length=512)
wav = tf.real(wav*tf.conj(wav))
# wav = tf.log(wav)
diff = sample - wav
loss_power = tf.reduce_mean(tf.reduce_mean(tf.square(diff),0))
# loss_power = tf.log(loss_power)

@zhf459 maybe, you can publish your code, i will check or follow it.

@kensun0 yes, please help me to make it work! thank you~ check this https://github.com/zhf459/P_wavenet_vocoder

@zhf459 I am so sorry that i have no time to read pytorch's code. :-(
If you follow google's implement, https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth , i can follow you easily.

have u got any quality wav?My result now is not ideal.

yes, i got normal wav, but is worse than original wavenet

my result is also normal, but worse than world...2333

@kensun0 , could you share some of your examples?

And, is the repo in your github the final code of your parallel wavenet?

I'm not quite understand how to compute the H(Ps) & H(Ps, Pt). How is the expectation could be computed by Monte Carlo Sampling?

I am not sure if your pseudo code for the student network is correct:

for f in flow:					
		
	    new_z = shiftright(z)
				
	    for i in layers-1:
		
		    new_z_i = H_i(new_z_i,θs_i)
						
		    new_z_i += new_enc
				
	    mu_s_f, scale_s_f = H_i(new_z_i,θs_i)		#last layer
					
	    mu_tot = mu_s_f + mu_tot*scale_s_f
					
	    scale_tot = scale_tot*scale_s_f
		
	    z = z*scale_s_f + mu_s_f 

I think new_z = shiftright(z) is not necessary.

commented

https://github.com/bfs18/nsynth_wavenet
I implement a minimum demo code for parallel wavenet based on nsynth.
Not finish tuning yet.

@bfs18 do you get any good samples?

@weixsong Sorry, i can not do this, i uesd commercial datasets.

@zhang-jian xt=zt*s(z<t, theta)+u(z<t,theta), we can not use zt to infer s(z<t, theta) and u(z<t,theta). is it right?

hi @bfs18
I think your kl_loss implement is Non-differentiable at https://github.com/bfs18/nsynth_wavenet/blob/master/wavenet/parallel_wavenet.py#L173 .
Is there any samples?

commented

Hi @xuerq
There is no analytic solution for the kl divergence in the paper. So sampling is needed. All the operations to calculate the sampling version of the kl divergence are differentiable. So gradient method can be used. I think the kl calculation in the code follows the paper.
https://github.com/bfs18/nsynth_wavenet/blob/6ef0e140899071bf04310af1bb9d9d99d8c7e747/wavenet/parallel_wavenet.py#L167
x_xp are the samples of xi given x_<i.

commented

@xuerq
I don't know the implementation details, but these two operations are differentiable in tf-1.8. I just tested this. https://github.com/bfs18/nsynth_wavenet/blob/16f34e1f79985b023ac3538621a3169801414b87/tests/test_parallel_wavenet.py#L46
You can test is by yourself.

@bfs18 Good job~

@kensun0
My bad, I understand now.
From line https://github.com/bfs18/nsynth_wavenet/blob/master/wavenet/parallel_wavenet.py#L70 to https://github.com/bfs18/nsynth_wavenet/blob/master/wavenet/parallel_wavenet.py#L103, mean and scale are for <t. We then get new_x using the original x, mean and scale at https://github.com/bfs18/nsynth_wavenet/blob/master/wavenet/parallel_wavenet.py#L105 (also Equation 2 in the parallel wavenet paper.)

Therefore, the output for the student network is for time step [0..t], e.g. https://github.com/bfs18/nsynth_wavenet/blob/master/wavenet/parallel_wavenet.py#L167

Since the same shift is also used in the teacher network, then time step between teacher and student network is the same, both [0..t].

@bfs18 My tf is 1.5,the “tf.cast” and “tf.floor” OP is not differentiable
" ValueError: No gradients provided for any variable"
I will try 1.8 later,It perplexed me for a long time , thank you!

Hi,
The parallel paper side that "The cross-entropy term H(PS, PT ) however explicitly depends on x = g(z), and therefore requires sampling from the student to estimate." before Equation (9) ~ (13). So, I think that https://github.com/bfs18/nsynth_wavenet/blob/master/wavenet/parallel_wavenet.py#L175 should use x_xp_scaled rather than x_scaled.

commented

I got some positive results. The generated wave is noisy but intelligible.
https://github.com/bfs18/nsynth_wavenet/blob/master/tests/pred_data-pwn-failed_cases/gen_LJ001-0001-stft_abs.wav

@bfs18
some advices:

  1. compare (np.random.logistic(0.000,np.exp(-7))*32768) with (np.random.logistic(0.000,np.exp(-7))*128), mayebe changing -7 to -12 will be better.
  2. in masked.py, we do not padding the input wave, so at about wav[0:receptive_field], the output is wrong, wav[receptive_field:wav.length] is right. maybe you should extend input length first, then cut them to wav.length-receptive_field
commented

Hi @kensun0
Thanks for your advice.

  1. sounds plausible, since I don't know from where the magic lower bound -7.0 comes in pixel-cnn code.
  2. Honestly, I implemented the extending you mentioned in a previous version. It is a bit more complicated than the zero-padding in nsynth. So I take the later method. When predicting a new wave, we always start from zeros. If zero-padding is used, we just take every crop as a full sentence sample.

@bfs18
2. it is not about zero-padding, P(Xi|Xi-receptive_field, ... , Xi-1). in order to get one predict point, we need to know groundtruth points before it. If you cut the original wave into some pieces in training, be careful, groundtruth points are not zeros(padded) always.

@kensun0 What is the motivation of doing?

compare (np.random.logistic(0.000,np.exp(-7))*32768) with (np.random.logistic(0.000,np.exp(-7))*128), mayebe changing -7 to -12 will be better.

@zhang-jian do the compare, sample from the two distribution many times. think about how to generate wave or image.