ibab / tensorflow-wavenet

A TensorFlow implementation of DeepMind's WaveNet paper

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

argmax or random.choice in generate?

HyperGD1994 opened this issue · comments

In generate.py, it use random.choice with scaled_prediction to predict next sample, i'm confused about why it doesn't use argmax to choose the highest prediction every time?

i have tried it but it doesn't work, always silence the whole time. anyone have any idea? thanks

Wavenet predicts a probability distribution for the next sample of the Waveform. In general, always picking the mode of a probability distribution will result in a sample which is very unrealistic.

To see why you are getting silence, suppose that you train on a dataset where the first 500ms is silence and then there is speech in the second 500ms. If Wavenet sees that there was silence in its input, it will predict that the next sample is very likely to be silence, but there is some small probability that it is not silence because the speech has to start somewhere, after all. If you randomly sample from this probability distribution, you will find that you get silence for a little while, and then at some point you get not silence (which will hopefully sound like speech). But if you are always picking the most likely value, you will always pick silence and you will never get speech.

commented

Wavenet predicts a probability distribution for the next sample of the Waveform. In general, always picking the mode of a probability distribution will result in a sample which is very unrealistic.

To see why you are getting silence, suppose that you train on a dataset where the first 500ms is silence and then there is speech in the second 500ms. If Wavenet sees that there was silence in its input, it will predict that the next sample is very likely to be silence, but there is some small probability that it is not silence because the speech has to start somewhere, after all. If you randomly sample from this probability distribution, you will find that you get silence for a little while, and then at some point you get not silence (which will hopefully sound like speech). But if you are always picking the most likely value, you will always pick silence and you will never get speech.

How not to choose the most likely value

Tensorflow translation example brought me here (encoder decoder ) :https://colab.research.google.com/github/tensorflow/text/blob/master/docs/tutorials/nmt_with_attention.ipynb

I'm not very sure how @joe-antognini 's answer applies to translation but I like the idea of "always picking the mode of a probability distribution will result in a sample which is very unrealistic" especially when we're talking about human related themes such as language etc ..