fatchord / WaveRNN

WaveRNN Vocoder + TTS

Home Page:https://fatchord.github.io/model_outputs/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Silence in batch-generated audio

742617000027 opened this issue · comments

Hey all!

We've encountered a problem where sometimes, the WaveRNN predicts audio (with batched generation enabled) interspersed with relatively long chunks of silence (values near 0). See the below image as an example.

Silence in audio

The WaveRNN has been trained for one million steps with the following hparams:

# Settings for all models
sample_rate = 8000
n_fft = 256
fft_bins = n_fft // 2 + 1
num_mels = 80
hop_length = 128
win_length = 256
fmin = None
min_level_db = -100
ref_level_db = 20
bits = 9
mu_law = True                       
peak_norm = False

# Model Hparams
voc_mode = 'RAW'
voc_upsample_factors = (4, 4, 8)
voc_rnn_dims = 512
voc_fc_dims = 512
voc_compute_dims = 128
voc_res_out_dims = 128
voc_res_blocks = 10

# Training
voc_batch_size = 50
voc_lr = 1e-4
voc_checkpoint_every = 10_000
voc_gen_at_checkpoint = 5
voc_total_steps = 1_000_000
voc_test_samples = 50
voc_pad = 2
voc_seq_len = hop_length * 20
voc_clip_grad_norm = 4

# Generating / Synthesizing
voc_gen_batched = True
voc_target = 11_000
voc_overlap = 550

We produce the Mel conditioning ourselves as part of our application, it's not the result of the Tacotron TTS model.

We have been wondering why the model would produce consecutive chunks of silence instead of—for example—random noise, like it does in the beginning of training. Opinions and ideas on the matter would be greatly appreciated!

How does the wavernn perform on ground truth mels? Same problem?

Hey, thanks for the quick reply! We haven't actually tried inference with ground truth Mels yet, will do and report back!

One thing to note is that the model does not consistently fail to predict audio, even when given the same conditioning. So on the example from the OP, the model might do fine half of the time and produce outputs with interspersed silence the rest of the time. Varying outputs are of course to be expected given the sampling procedure that is performed to pick the actual audio sample values, so might the silence be just some form of mode collapse? Then again, it seems strange that the silence would occur so contiguously, since during batched generation the batched parts of the signal are generated independently of each other, correct?

Is this fixed with 544cd5d ?

Unfortunately not. We're currently investigating whether there might be an issue with torch.distributions.Categorical's .sample() method, since there has been at least one reported case of broken random sampling in torch, albeit for a different class (see pytorch/pytorch#22529). Right now, for testing purposes, we're using numpy instead of torch to do the sampling. Will report back if we gain any meaningful insights.