soroushmehr / sampleRNN_ICLR2017

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

Home Page:https://arxiv.org/abs/1612.07837

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

To continue test wav?

lukezos opened this issue · comments

First: your work is absolutely awesome! The fact that model is capable to generate reasonably sounding signal for many seconds (and hundreds of thousands samples) is awesome.

Second: my question:
I have trained models on music dataset successfully. I would like to see (hear) continuation of given wav file generated by the model. Basically to find out how well and how long model is able to continue input sound.
Please give me a hint how to do that,
thanks!
Lukas

Thank you, however I think this will just generate longer sequence initialised by (from two_tier.py):

First half zero, others fixed random at each checkpoint

h0 = numpy.zeros(
        (N_SEQS-fixed_rand_h0.shape[0], N_RNN, H0_MULT*DIM),
        dtype='float32'
)
h0 = numpy.concatenate((h0, fixed_rand_h0), axis=0)

My point is to continue "real" (i.e. from test/validate/train npy file) sequence instead and see how model is able to continue current note, tempo, etc. (for music database).
Should I just replace the above initialisation with feeding of "real" sequence?

thanks,
Lukas

Basically, you'll use your audio to compute hidden states of the RNN and then you'll use them as initial hidden state when you start generating.

This would amount to inserting a loop like https://github.com/soroushmehr/sampleRNN_ICLR2017/blob/master/models/three_tier/three_tier.py#L685 before this loop, but samples here will contain the audio that you have (with all preprocessing) but you'll not have this line(https://github.com/soroushmehr/sampleRNN_ICLR2017/blob/master/models/three_tier/three_tier.py#L702) i.e. you'll not be updating your seeded audio but only get the updated hidden states.

Then, you can use the generation loop with the new hidden states to generate audio.

Alternatively, you can concatenate your audio before zeros in the samples array but when running the generation loop you will not be updating samples array for timesteps which correspond to the seeded audio.

Hi!

Thank you!
I went with the second option: "Alternatively, you can concatenate your audio before zeros in the samples array but when running the generation loop you will not be updating samples array for timesteps which correspond to the seeded audio."

For best results what should be the length of seeded audio, with default running parameters for for three_tier and two_tier models?

The longer, the better. In my opinion having around 1-2 seconds should be sufficient to capture the texture of audio, however, it depends on many other things, like kind of data on which model was originally trained on, etc.