The IAF?

Question

The IAF?

zhf459 opened this issue 7 years ago · comments

did you consider the IAF(Inverse Autoregressive Flow)? the paper said the student use the iaf to generate wave in a parallelized way.

kensun0 · Answer 1 · Fri Feb 02 2018 10:59:41 GMT+0800 (China Standard Time)

Yes, I think it is IAF now.

zhf459 · Answer 2 · Wed Feb 07 2018 17:41:10 GMT+0800 (China Standard Time)

@kensun0 ,can you explain more details?It seems no mu_t and scale_t as output of the original wavenet.What's the z nosie looks like ,i think it works like a autoencoder(in autogressive way),so the z just sample from logistic(0,1),and have the same shape like input X and encoding?thank you very much

kensun0 · Answer 3 · Wed Feb 07 2018 19:55:02 GMT+0800 (China Standard Time)

original wavenet output 256 softmax scores as classification. The parallel paper said that "Since training a 65,536-way categorical distribution would be prohibitively costly, we instead modelled the samples with the discretized mixture of logistics distribution introduced in [23]." So, mu_t and scale_t are from [23].

kensun0 · Answer 4 · Wed Feb 07 2018 20:00:33 GMT+0800 (China Standard Time)

If we get z in autogressive way, we can't generate wave in parallel way. Right?
I think z and x have the same shape, before x+enc，we must upsampling encoding to the shape of x.

zhf459 · Answer 5 · Wed Feb 07 2018 21:19:53 GMT+0800 (China Standard Time)

@kensun0 oh,i see ,there will out put 3 parameters mixture of logistics distribution :pi_t,mu_t,scale_t [pixcelcnn++]? i still confused about how to generate wave:sampe z noise and it will generate wave parallel,what's the output shape,will you share your code ?I can't wait to see the details.

kensun0 · Answer 6 · Thu Feb 08 2018 12:56:24 GMT+0800 (China Standard Time)

Yes, if we use one mixture, we can remove pi_t.
Sorry, I won't share my codes.
The shape of output is as the same as z.

zhf459 · Answer 7 · Thu Feb 08 2018 22:50:18 GMT+0800 (China Standard Time)

@kensun0 very nice of you, thank you!

jiqizaisikao · Answer 8 · Fri Feb 09 2018 20:12:54 GMT+0800 (China Standard Time)

@zhf459 I think is that when you use logistic mixture model,at the first flow end you sample wave like result as the input of the next input ,and so on until the last flow you will get better sample wave,but when you use categorical distribution we just need one flow at the end just make the loss between the teacher and the student drop?
i donot know if i understand it right.The IAF source code of open ai seems difficult for me to understand,Will we have to use all of the souce code of the original IAF，there are too much codes?maye we can work together to complete it

zhf459 · Answer 9 · Sun Feb 11 2018 09:30:55 GMT+0800 (China Standard Time)

@jiqizaisikao yes ,please email me haifengzeng459@gmail.com

zhf459 · Answer 10 · Sun Feb 11 2018 16:47:59 GMT+0800 (China Standard Time)

@kensun0 hi, since the paper said the student wavenet don't have skip connection layer ,so what's the last layer, and there 4 iaf layers with size=[10,10,10,30], each iaf layer is a simplified wavenet?

kensun0 · Answer 11 · Mon Feb 12 2018 15:36:59 GMT+0800 (China Standard Time)

the last layer output the parameters of logistic distribution, its shape is [wav length, channels].
if you use one mixture, channels=2, there are mu_tot and scale_tot.
yes, each iaf is a wavenet.

zhf459 · Answer 12 · Mon Feb 12 2018 21:33:41 GMT+0800 (China Standard Time)

@kensun0 ,I use the original last layer with one mixture output in student while 10-mixture logistic in teacher,is that ok? how's your final result,can you upload some samples?

kensun0 · Answer 13 · Tue Feb 13 2018 14:09:15 GMT+0800 (China Standard Time)

That is OK. I also do that.

jiqizaisikao · Answer 14 · Sun Feb 25 2018 10:33:43 GMT+0800 (China Standard Time)

Ok,i will try again

zhf459 · Answer 15 · Tue Mar 06 2018 14:09:59 GMT+0800 (China Standard Time)

@kensun0 hi, how do you calculate the power loss? I use the following code but get very large loss, how can i fix this:

def get_power_loss(sample_, x_):
    batch = sample_.shape[0]
    s = 0
    for i in range(batch):
        ss = np.abs(librosa.stft(sample_[i][0])) ** 2 - np.abs(librosa.stft(x_[i][0])) ** 2
        s += np.sum(ss ** 2)
    return s / batch

jiqizaisikao · Answer 16 · Thu Mar 08 2018 17:19:07 GMT+0800 (China Standard Time)

@zhf459 i have test power_loss and it works right,but i do not know how to complete the crossentropy loss,have you try it?

zhf459 · Answer 17 · Thu Mar 08 2018 17:58:08 GMT+0800 (China Standard Time)

@jiqizaisikao what do you mean by it works right, did it works in pw? I try some ways to calculate the kl loss ,but I have no idea weather it work or not.

kensun0 · Answer 18 · Thu Mar 08 2018 18:50:34 GMT+0800 (China Standard Time)

wav = tf.contrib.signal.stft(wav,512,256,fft_length=512)
wav = tf.real(wav*tf.conj(wav))
# wav = tf.log(wav)
diff = sample - wav
loss_power = tf.reduce_mean(tf.reduce_mean(tf.square(diff),0))
# loss_power = tf.log(loss_power)

kensun0 · Answer 19 · Thu Mar 08 2018 18:54:36 GMT+0800 (China Standard Time)

@zhf459 maybe, you can publish your code, i will check or follow it.

jiqizaisikao · Answer 20 · Fri Mar 09 2018 09:08:03 GMT+0800 (China Standard Time)

@zhf459 https://github.com/locuslab/pytorch_fft

zhf459 · Answer 21 · Fri Mar 09 2018 15:45:25 GMT+0800 (China Standard Time)

@kensun0 yes, please help me to make it work! thank you~ check this https://github.com/zhf459/P_wavenet_vocoder

kensun0 · Answer 22 · Sat Mar 10 2018 15:06:52 GMT+0800 (China Standard Time)

@zhf459 I am so sorry that i have no time to read pytorch's code. :-(
If you follow google's implement, https://github.com/tensorflow/magenta/tree/master/magenta/models/nsynth , i can follow you easily.

neverjoe · Answer 23 · Mon Apr 23 2018 14:59:34 GMT+0800 (China Standard Time)

have u got any quality wav?My result now is not ideal.

kensun0 · Answer 24 · Wed Apr 25 2018 17:51:17 GMT+0800 (China Standard Time)

yes, i got normal wav, but is worse than original wavenet

neverjoe · Answer 25 · Wed Apr 25 2018 21:01:00 GMT+0800 (China Standard Time)

my result is also normal, but worse than world...2333

wei song · Answer 26 · Sat May 12 2018 17:45:26 GMT+0800 (China Standard Time)

@kensun0 , could you share some of your examples?

And, is the repo in your github the final code of your parallel wavenet?

I'm not quite understand how to compute the H(Ps) & H(Ps, Pt). How is the expectation could be computed by Monte Carlo Sampling?

zhang-jian · Answer 27 · Fri May 25 2018 06:47:02 GMT+0800 (China Standard Time)

I am not sure if your pseudo code for the student network is correct:

for f in flow:					
		
	    new_z = shiftright(z)
				
	    for i in layers-1:
		
		    new_z_i = H_i(new_z_i,θs_i)
						
		    new_z_i += new_enc
				
	    mu_s_f, scale_s_f = H_i(new_z_i,θs_i)		#last layer
					
	    mu_tot = mu_s_f + mu_tot*scale_s_f
					
	    scale_tot = scale_tot*scale_s_f
		
	    z = z*scale_s_f + mu_s_f

I think new_z = shiftright(z) is not necessary.

bfs18 · Answer 28 · Fri May 25 2018 16:38:17 GMT+0800 (China Standard Time)

https://github.com/bfs18/nsynth_wavenet
I implement a minimum demo code for parallel wavenet based on nsynth.
Not finish tuning yet.

zhf459 · Answer 29 · Fri May 25 2018 18:32:27 GMT+0800 (China Standard Time)

@bfs18 do you get any good samples?

kensun0 · Answer 30 · Sat May 26 2018 18:15:41 GMT+0800 (China Standard Time)

@weixsong Sorry, i can not do this, i uesd commercial datasets.

kensun0 · Answer 31 · Sat May 26 2018 18:24:37 GMT+0800 (China Standard Time)

@zhang-jian xt=zt*s(z<t, theta)+u(z<t,theta), we can not use zt to infer s(z<t, theta) and u(z<t,theta). is it right?

Xue Ruiqing · Answer 32 · Sat May 26 2018 23:22:22 GMT+0800 (China Standard Time)

hi @bfs18
I think your kl_loss implement is Non-differentiable at https://github.com/bfs18/nsynth_wavenet/blob/master/wavenet/parallel_wavenet.py#L173 .
Is there any samples?

bfs18 · Answer 33 · Sun May 27 2018 15:20:39 GMT+0800 (China Standard Time)

Hi @xuerq
There is no analytic solution for the kl divergence in the paper. So sampling is needed. All the operations to calculate the sampling version of the kl divergence are differentiable. So gradient method can be used. I think the kl calculation in the code follows the paper.
https://github.com/bfs18/nsynth_wavenet/blob/6ef0e140899071bf04310af1bb9d9d99d8c7e747/wavenet/parallel_wavenet.py#L167
x_xp are the samples of xi given x_<i.

Xue Ruiqing · Answer 34 · Sun May 27 2018 16:22:28 GMT+0800 (China Standard Time)

@bfs18 https://github.com/bfs18/nsynth_wavenet/blob/6ef0e140899071bf04310af1bb9d9d99d8c7e747/auxilaries/utils.py#L152
“tf.cast” and “tf.floor” is Non-differentiable.

bfs18 · Answer 35 · Sun May 27 2018 17:08:14 GMT+0800 (China Standard Time)

@xuerq
I don't know the implementation details, but these two operations are differentiable in tf-1.8. I just tested this. https://github.com/bfs18/nsynth_wavenet/blob/16f34e1f79985b023ac3538621a3169801414b87/tests/test_parallel_wavenet.py#L46
You can test is by yourself.

kensun0 · Answer 36 · Sun May 27 2018 18:28:01 GMT+0800 (China Standard Time)

@bfs18 Good job~

zhang-jian · Answer 37 · Sun May 27 2018 19:07:12 GMT+0800 (China Standard Time)

@kensun0
My bad, I understand now.
From line https://github.com/bfs18/nsynth_wavenet/blob/master/wavenet/parallel_wavenet.py#L70 to https://github.com/bfs18/nsynth_wavenet/blob/master/wavenet/parallel_wavenet.py#L103, mean and scale are for <t. We then get new_x using the original x, mean and scale at https://github.com/bfs18/nsynth_wavenet/blob/master/wavenet/parallel_wavenet.py#L105 (also Equation 2 in the parallel wavenet paper.)

Therefore, the output for the student network is for time step [0..t], e.g. https://github.com/bfs18/nsynth_wavenet/blob/master/wavenet/parallel_wavenet.py#L167

Since the same shift is also used in the teacher network, then time step between teacher and student network is the same, both [0..t].

Xue Ruiqing · Answer 38 · Sun May 27 2018 23:47:59 GMT+0800 (China Standard Time)

@bfs18 My tf is 1.5，the “tf.cast” and “tf.floor” OP is not differentiable
" ValueError: No gradients provided for any variable"
I will try 1.8 later，It perplexed me for a long time , thank you！

zhang-jian · Answer 39 · Mon May 28 2018 15:54:59 GMT+0800 (China Standard Time)

Hi,
The parallel paper side that "The cross-entropy term H(PS, PT ) however explicitly depends on x = g(z), and therefore requires sampling from the student to estimate." before Equation (9) ~ (13). So, I think that https://github.com/bfs18/nsynth_wavenet/blob/master/wavenet/parallel_wavenet.py#L175 should use x_xp_scaled rather than x_scaled.

bfs18 · Answer 40 · Mon Jul 02 2018 19:42:08 GMT+0800 (China Standard Time)

I got some positive results. The generated wave is noisy but intelligible.
https://github.com/bfs18/nsynth_wavenet/blob/master/tests/pred_data-pwn-failed_cases/gen_LJ001-0001-stft_abs.wav

kensun0 · Answer 41 · Mon Jul 09 2018 13:38:52 GMT+0800 (China Standard Time)

@bfs18
some advices:

compare (np.random.logistic(0.000,np.exp(-7))*32768) with (np.random.logistic(0.000,np.exp(-7))*128), mayebe changing -7 to -12 will be better.
in masked.py, we do not padding the input wave, so at about wav[0:receptive_field], the output is wrong, wav[receptive_field:wav.length] is right. maybe you should extend input length first, then cut them to wav.length-receptive_field

bfs18 · Answer 42 · Mon Jul 09 2018 19:25:55 GMT+0800 (China Standard Time)

Hi @kensun0
Thanks for your advice.

sounds plausible, since I don't know from where the magic lower bound -7.0 comes in pixel-cnn code.
Honestly, I implemented the extending you mentioned in a previous version. It is a bit more complicated than the zero-padding in nsynth. So I take the later method. When predicting a new wave, we always start from zeros. If zero-padding is used, we just take every crop as a full sentence sample.

kensun0 · Answer 43 · Mon Jul 09 2018 20:09:42 GMT+0800 (China Standard Time)

@bfs18
2. it is not about zero-padding, P(Xi|Xi-receptive_field, ... , Xi-1). in order to get one predict point, we need to know groundtruth points before it. If you cut the original wave into some pieces in training, be careful, groundtruth points are not zeros(padded) always.

zhang-jian · Answer 44 · Mon Jul 09 2018 21:33:24 GMT+0800 (China Standard Time)

@kensun0 What is the motivation of doing?

compare (np.random.logistic(0.000,np.exp(-7))*32768) with (np.random.logistic(0.000,np.exp(-7))*128), mayebe changing -7 to -12 will be better.

kensun0 · Answer 45 · Tue Jul 10 2018 10:58:03 GMT+0800 (China Standard Time)

@zhang-jian do the compare, sample from the two distribution many times. think about how to generate wave or image.