soroushmehr / sampleRNN_ICLR2017

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

Home Page:https://arxiv.org/abs/1612.07837

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pretty amazing results from a training run on classical guitar music (single instrument), at epoch 5 & 7

LinkOne1A opened this issue · comments

https://soundcloud.com/user-637335781/sets/training-1-on-classical-guitar-music-single-instrument-at-epoch-5-7

Subjectivley speaking , sound quality is better than the results I got from training on the piano set.

WAV file generated at epoch 5 (~15k training samples) and epoch 7 (~20K training samples)

Single GPU : 8 GB GTX 1080 with 2560 CUDA Cores
End of Epoch 7 at about 8 hours

Validation! Done!

>>> Best validation cost of 1.78753066063 reached. Testing! Done!
>>> test cost:1.8329474926	total time:60.4850599766
epoch:7	total iters:20498	wall clock time:7.04h
>>> Lowest valid cost:1.78753066063	 Corresponding test cost:1.8329474926
	train cost:1.7714	total time:6.00h	per iter:1.054s
	valid cost:1.7875	total time:0.02h
	test  cost:1.8329	total time:0.02h
Saving params! Done!

Run command:
THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python -u models/two_tier/two_tier.py --exp BEST_2TIER --n_frames 64 --frame_size 16 --emb_size 256 --skip_conn False --dim 1024 --n_rnn 3 --rnn_type GRU --q_levels 256 --q_type linear --batch_size 64 --weight_norm True --learn_h0 True --which_set MUSIC

Training done on several guitar passages from youtube, audio duration ~ 4 hours

// Private ref uuid : 6194a59d-b6ab-470d-95fd-6b43fe9b2daa

Thanks for sharing. Good to know! :)
People have tried it on different datasets, including Korean, classical music, and ambient. @richardassar even got interesting results from training on a couple of hours of Tangerine Dream works. See: https://soundcloud.com/psylent-v/tracks

Do you know if the generated sound (of t-dream) is purely by the network vs mixed with some supporting passages (such as drums) by a human?

@LinkOne1A These guitar samples are really nice

Credit to the N/network and folks with the sampelRNN paper!

I'm surprised that trainng on multi instrument data worked so well, I'm puzzled why that is. My intuition was that multiple instruments playing at the same time limit the space of valid ("pleasing" maybe is a better word) combinations (output), not sure how I would go about proving (or disprovinf this).

What has been your experience in this area?

For Tangerine Dream the validation loss (in bits) was near to 3.2 after
300k iterations vs below 1.0 for solo piano. Note, I used the
three_tier.py model in both cases.

It seems weight normalisation, both in the linear layers and the
transformations inside the GRUs, helps with generalisation. I'm conducting
some experiments to verify this for myself.

Some of SampleRNN's capacity to generalise might be due to the quantization
noise introduced in going to 8 bits. It may be interesting to try something
like https://arxiv.org/pdf/1701.06548.pdf to further improve generalisation
however I've yet to observe an increase in validation loss so we're
probably slightly underparameterised on these datasets.

Ignoring independent_preds the three_tier model has around 18 million
parameters which is evidently sufficient to capture the temporal dynamics
of the signal to an acceptable degree. If you think about it from an
information theoretic point of view, kolmogorov complexity / minimum
description length, there's a lot of redundancy in the signal that can be
"compressed" away by the network.

The model seems to capture various synthesiser sounds, crowd noise from
the live recordings, both synthetic and real drums including correct
percussive patterns however it could not maintain a consistent tempo - this
could be helped with some conditioning by an auxiliary time series.

If you used preprocess.py in datasets/music/ then you may want to run
your experiments again. See:
#14

Thanks for the detailes! Interestring and surprising that loss validation of 3.2 produces the t/dream segment.

What was the length of the original (total) audio?

How long did it take to get to 300K steps, and what kinda GPU do you have?

The quantization noise (going 8 bit), are you refereing to the mu-law encode/decode? I ran a stand alone test of mu-law affected wav file vs original wav, could not hear the differance and an inverted/summation of the two sources in audacity showed very little amplitude, mostly I think on the high end of the spectrum.

I haven't thought about the descriptive complexity of the source signal. Now that you mention it, I'd say that if a network had to deal with already compressed data and had to figure out a predictive solution to the next likely outcome in a series, I don't know... but my hunch is that it would be more difficult for the network, which means we would need more complexity (layers) in the network. This is my off-the-cuff though!

I'll check it out ( #14 ).

I have not yet looked into the 3 tier training.

Total audio was about 32 hours, although due to the bug I didn't end up
training on all of it!

It seems that the required loss for acceptable samples is really relative
to the dataset, multiple instruments increase the entropy of the signal and
unlike piano it seems to get "lost" far less frequently because it has
more of a varied space in which to recover. Before fully converging the
piano samples sometimes go unstable, this effect was almost non-existent
when training on Tangerine Dream.

No, I'm referring to quantization noise as x - Q(x) for any quantization
scheme. Introducing noise of any kind acts as a regulariser e.g.
https://en.wikipedia.org/wiki/Tikhonov_regularization

It's true that compressed data has more entropy but if you decompress the
signal again, assuming lossy compression, the resulting entropy is lower
than the original signal and should be easier to model. I was referring to
the compressability of the signal, it seems there's plenty scope for that.

Something that would be interesting to try, akin to speaker conditioning as
mentioned in the Char2Wav paper, is conditioning on instruments on genre
with an embedded one-hot signal. This might allow interpretation between
styles etc. This is an area of research I'll be looking into over the next
while.

It took a couple of days to get to 300k steps, I'm training on a GTX1080.
The machine I have has two but the script does not split minibatches over
both GPUs. I have implemented SampleRNN myself and it can train 4x faster
(without GRU weightnorm), soon to be released.

Although I have avoided it so far it's probably worth filtering out low amplitude audio segments from the training set. These get amplified during normalization which pulls up the noise floor introducing lots of high energy noise which can only disrupt or slow down training.

A plot of the RMS power over each segment in the piano database shows the distribution and you can see the tail of low energy signals on the right which could probably be pruned (especially the one segment with zero energy).

rms_powers