r9y9 / wavenet_vocoder

This is an umbrella issue to track progress for my planned TODOs. Comments and requests are welcome.

Goal

achieve higher speech quality than conventional vocoder (WORLD, griffin-lim, etc)
provide pre-trained model of WaveNet-based mel-spectrogram vocoder

Model

Training script

Local conditioning
Global conditioning
Configurable maximum number of time steps (to avoid out of memory error). 58ad07f

Experiments

unconditioned WaveNet trained with CMU Arctic
conditioning model on mel-spectrogram (local conditioning) with CMU Arctic
conditioning model on mel-spectrogram and speaker id with CMU Arctic
conditioning model on mel-spectrogram (local conditioning) with LJSpeech
DeepVoice3 + WaveNet vocoder r9y9/deepvoice3_pytorch#21

Misc

~~[ ] Time sliced data generator?~~
Travis CI
Train/val split
README

Sampling frequency

Advanced (lower priority)

Mixture of logistic distributions #5
polyak averaging https://discuss.pytorch.org/t/how-to-apply-exponential-moving-average-decay-for-variables/10856 #5
Faster generation
Parallel WaveNet

At the moment, I think I finished to implement basic features (batch/incremental inference, local/global conditioning) and confirmed that unconditioned WaveNet trained on CMU Arctic (~1200 utterances, 16kHz) can generate sounds like speech. Audio samples are attached.

step80000.zip

top: real speech, bottom: generated speech. The only first one sample of real-speech was fed to the WaveNet decoder as an initial input.

step90000.zip

For reference, these are other wavenet projects I know of:
https://github.com/ibab/tensorflow-wavenet
https://github.com/tomlepaine/fast-wavenet - faster version of the original wavenet paper.

Other projects I know of:

Still not quite high quality, but vocoder conditioned on mel-spectrogram started to work. Audio samples from a model trained 10 hours are attached.

step90000.zip

step95000.zip

Finished transposed convolution support at 8c0b5a9. Started training again.

Hi, I've already tried to use linguistic features as local features, but I found there might be a problem that linguistic features are based on phoneme class, mel-specs are based on frame class, but the local features of wavenet inputs are based on sample point class.

Here is a case, if a phoneme's duration is 0.25s, and its sample rate is 16k, in order to create the wavenet inputs, I have to duplicate the single phoneme's linguistic feature to int(0.25 * 16000) times as their samples' local features. Do you think my practice is right or not? How do you process the mel-spec features while they are frame class?

Thanks for answering me.

Wavenet can capture the differences even if many samples' local features are same as long as its receptive field is wide?

@jamestang0219 I think you are right. In the paper http://www.isca-speech.org/archive/Interspeech_2017/pdfs/0314.PDF, they use log-f0 and mel-cepstrum as conditional features and duplicate them to adjust time resolution. I also tried this idea and got reasonable result.

Latest audio sample attached. Mel-spectrogram are repeated to adjust time resolution. See

wavenet_vocoder/audio.py

Lines 39 to 40 in b8ee2ce

    
           upsample_factor = quantized.size // mel.shape[0] 
        
           mel = np.repeat(mel, upsample_factor, axis=0)

. In this case upsample_factor was always 256.

step70000.zip

@r9y9 In your source code, you use transposed convolution to implement the upsample process? Have you ever checked which method is better for upsampling?

@jamestang0219 I implemented transposed convolution but haven't got success yet. I wonder 256x upsampling is hard to train, especially for small dataset which I'm experimenting with now. WaveNet authors reported transposed convolution is better, though.

wavenet_vocoder/hparams.py

Lines 43 to 47 in 3c9deb1

    
           # If True, use transposed convolutions to upsample conditional features, 
        
           # otherwise repeat features to adjast time resolution 
        
           upsample_conditional_features=False, 
        
           # should np.prod(upsample_scales) == hop_size 
        
           upsample_scales=[16, 16],

For now I am not using transposed convolution.

@r9y9 May I know your hyper parameters for extracting mel spectrogram? Frame shift is 0.0125s and frame width is 0.05s? If this is your parameters, but why you use 256 as the upsample factor instead of sr(16000) * frame_shift(0.0125) = 200? Any tricks here? Forgive me for many questions:( because I also wanna reproduce tacotron2 result

@jamestang0219 Hyper parameters for audio parameter extraction:

wavenet_vocoder/hparams.py

Lines 19 to 28 in 3c9deb1

    
           # Audio: 
        
           sample_rate=16000, 
        
           silence_threshold=2, 
        
           num_mels=80, 
        
           fft_size=1024, 
        
           # shift can be specified by either hop_size or frame_shift_ms 
        
           hop_size=256, 
        
           frame_shift_ms=None, 
        
           min_level_db=-100, 
        
           ref_level_db=20,

I use frame shift 256 samples / 16 ms.

@r9y9 Thanks:)

@r9y9 I notice that in Tacotron2, two upsampling layers with transposed convolution are used. But in my WaveNet implementation, it still can't work.

@npuichigo Could you share what parameters (padding, kernel_size, etc) you are using? I tried 1d transposed covolution with stride=16, kernel_size=16, padding=0 two times to upsample inputs to 256x.

wavenet_vocoder/wavenet_vocoder/wavenet.py

Lines 105 to 112 in 8c0b5a9

    
           if upsample_conditional_features: 
        
               self.upsample_conv = nn.ModuleList() 
        
               for s in upsample_scales: 
        
                   self.upsample_conv.append(ConvTranspose1d( 
        
                       cin_channels, cin_channels, kernel_size=s, padding=0, 
        
                       dilation=1, stride=s, std_mul=1.0)) 
        
                   # Is this non-lineality necessary? 
        
                   self.upsample_conv.append(nn.ReLU(inplace=True))

@r9y9 Parameters of mine are listed below. Because I use frame shift which is 12.5ms, upsampling factor is 200.

# Audio
num_mels=80,
num_freq=1025,
sample_rate=16000,
frame_length_ms=50,
frame_shift_ms=12.5,
min_level_db=-100,
ref_level_db=20

# Tranposed convolution 10*20=200 (tensorflow)
up_lc_batch = tf.expand_dims(lc_batch, 1)
up_lc_batch = tf.layers.conv2d_transpose(
       up_lc_batch, self.out_channels, (1, 10),
       strides=(1, 10), padding='SAME',
       kernel_initializer=tf.constant_initializer(1.0 / self.out_channels))
up_lc_batch = tf.layers.conv2d_transpose(
       up_lc_batch, self.out_channels, (1, 20),
       strides=(1, 20), padding='SAME',
       kernel_initializer=tf.constant_initializer(1.0 / self.out_channels))
up_lc_batch = tf.squeeze(up_lc_batch, 1)

https://r9y9.github.io/wavenet_vocoder/

Created a simple project page and uploaded audio samples for speaker-dependent WaveNet vocoder. I'm working on global conditioning (speaker embedding) now.

@r9y9 Regarding upsampling network, I found that 2D transposed convolution works well, while 1D version will generate speech with unnatural prosody, maybe because 2D transpose convolution only consider local information in frequency domain.

height_width = 3  # kernel width along frequency axis
up_lc_batch = tf.expand_dims(lc_batch, 3)
up_lc_batch = tf.layers.conv2d_transpose(
       up_lc_batch, 1, (10, height_width),
       strides=(10, 1), padding='SAME',
       kernel_initializer=tf.constant_initializer(1.0 / height_width))
up_lc_batch = tf.layers.conv2d_transpose(
       up_lc_batch, 1, (20, height_width),
       strides=(20, 1), padding='SAME',
       kernel_initializer=tf.constant_initializer(1.0 / height_width))
up_lc_batch = tf.squeeze(up_lc_batch, 3)

@npuichigo Thank you for sharing that! Did you check the output of the upsampling network? Could upsampling network actually learn upsampling? I mean, did you get high-resolution mel-spectrogram? I was wondering if I need to add loss term regarding upsampling (e.g., MSE between coarse mel-spectrogram and 1-shift high resolution mel-spectrogram) and I'm curious whether it could be learned without upsampling specific loss.

@r9y9 I think transposed convolution with same stride and kernel size is similar to duplicating. Like the following picture, if the kernel is one everywhere, then it's just duplicating. So maybe I need to check the values of kernel after training.

https://r9y9.github.io/wavenet_vocoder/

Added audio samples for multi-speaker version of WaveNet vocoder.

Hello @r9y9 , great work and awesome samples, would you mind sharing the weights of the network for the wavenet_vocoder trained on mel_spectrograms with CMU artic dataset without speaker embedding ? I would like to use and compare them with griffin-lim reconstruction to see which works better.

@rishabh135 Not at all. Here it is: https://www.dropbox.com/sh/b1p32sxywo6xdnb/AAB2TU2DGhPDJgUzNc38Cz75a?dl=0

Note that you have to use exactly same mel-spectrogram extraction

wavenet_vocoder/audio.py

Lines 66 to 69 in f05e520

    
           def melspectrogram(y): 
        
               D = _lws_processor().stft(y).T 
        
               S = _amp_to_db(_linear_to_mel(np.abs(D))) 
        
               return _normalize(S)

and same hyper parameters

wavenet_vocoder/hparams.py

Lines 20 to 28 in f05e520

    
           sample_rate=16000, 
        
           silence_threshold=2, 
        
           num_mels=80, 
        
           fft_size=1024, 
        
           # shift can be specified by either hop_size or frame_shift_ms 
        
           hop_size=256, 
        
           frame_shift_ms=None, 
        
           min_level_db=-100, 
        
           ref_level_db=20,

Using the transposed convolution below, I can get good initialization for the upsampling network. Very nice, thanks @npuichigo !

kernel_size = 3
padding = (kernel_size - 1) // 2
upsample_factor = 16

conv = nn.ConvTranspose2d(1,1,kernel_size=(kernel_size,upsample_factor),
                          stride=(1,upsample_factor), padding=(padding,0))
conv.bias.data.zero_()
conv.weight.data.fill_(1/kernel_size);

Mel-spectrogram (hop_size = 256)

16x upsampled mel-spectrogram

I have added a brief README.

I tried using mel-specs as the input condition feature(by duplicating the mel-specs of each frame) to train the wavenet model on LJSpeech Dataset, but cannot get the reasonable result after 90k steps though the loss value keeps descending.

I found that in your training procedure there is no padding before the first time step, but in generating procedure, there ARE some padding before the initial value to satisfy the current conv layer's receptive field.

Does it receive the right start information in generating procedure?

Here are some log and waveplot:

Receptive field (samples / ms): 1021 / 46.3038548753

Epoch: 1, Avg_loss: 2.78487607414

Epoch: 2, Avg_loss: 2.33630954099

Epoch: 3, Avg_loss: 2.27544190529

Epoch: 4, Avg_loss: 2.21869832258

Epoch: 5, Avg_loss: 2.10031874228

Epoch: 6, Avg_loss: 1.78190998421

Epoch: 7, Avg_loss: 1.2610225059

I believe the padding for the first time step is handled by nn.Conv1d for both batch and incremental forward computation. It works at least for CMU ARCTIC.

wavenet_vocoder/wavenet_vocoder/modules.py

Lines 50 to 55 in 39961f5

    
           if padding is None: 
        
               # no future time stamps available 
        
               if causal: 
        
                   padding = (kernel_size - 1) * dilation 
        
               else: 
        
                   padding = (kernel_size - 1) // 2 * dilation

One possible reason why you cannot get good result I can think of is that speech samples in LJSpeech have reverberations. This might make it hard to learn long-term dependencies. Maybe you need more channels, layers, etc. I will also try LJSpeech soon.

By the way, I haven't got loss values < 1.9 but you seems to get the loss values < 1.5. How did you get that?

@r9y9 I don't know whether pre-emphasis and fft_size may influence the results. I use pre-emphasis=0.97, fft_size=2048.

And I didn't downsample the LJS waveform, it's 22050 samples per second.

I'll continue to try it on LJS dataset by changing model hyperparameters and architectures, and put the good results here.

the average loss is computed by below code:

while global_epoch < nepochs:
    running_loss = 0.0
    for step in enumerate(data_loader):
        '''
        training procedure here.
        '''
        loss = criterion(y_hat[:, :, :-1, :], y[:, 1:, :], mask=mask)
        print('Step: ' + str(global_step) + ', Loss: ' + str(loss.data[0]))
        running_loss += loss.data[0]
    averaged_loss = running_loss / (len(data_loader))
    global_epoch += 1
    print('Epoch: ' + str(global_epoch) + ', Avg_loss: ' + str(averaged_loss))

And the trends:

Receptive field (samples / ms): 1021 / 46.3038548753
Step: 1, Loss: 5.54383468628
Step: 2, Loss: 5.53908395767
Step: 3, Loss: 5.54337501526
...
Epoch: 1, Avg_loss: 2.78487607414
Step: 13078, Loss: 2.57374691963
Step: 13079, Loss: 2.64829707146
Step: 13080, Loss: 2.41457295418
...
Epoch: 2, Avg_loss: 2.33630954099
Step: 26155, Loss: 2.42479777336
Step: 26156, Loss: 2.46555685997
Step: 26157, Loss: 2.46027398109
...
Epoch: 3, Avg_loss: 2.27544190529
Step: 39232, Loss: 2.3482234478
Step: 39233, Loss: 2.19915151596
Step: 39234, Loss: 2.19278645515
...
Epoch: 4, Avg_loss: 2.21869832258
Step: 52309, Loss: 2.21737265587
Step: 52310, Loss: 2.24765992165
Step: 52311, Loss: 2.32877469063
...
Epoch: 5, Avg_loss: 2.10031874228
Step: 65386, Loss: 1.66409289837
Step: 65387, Loss: 1.55183362961
Step: 65388, Loss: 1.5333313942
...
Epoch: 6, Avg_loss: 1.78190998421
Step: 78463, Loss: 1.16430306435
Step: 78464, Loss: 0.585283100605
Step: 78465, Loss: 1.08736658096
...
Epoch: 7, Avg_loss: 1.2610225059
Step: 91540, Loss: 0.178501829505
Step: 91541, Loss: 0.168980106711
Step: 91542, Loss: 0.86912381649
...
Epoch: 8, Avg_loss: 0.81647025647
Step: 104617, Loss: 0.0704396516085
Step: 104618, Loss: 0.401492923498
Step: 104619, Loss: 0.103701047599

Thank you for the information! I will share when I can get good results. I'm trying the following hyper parameters

diff --git a/hparams.py b/hparams.py
index 3a0be85..0189c30 100644
--- a/hparams.py
+++ b/hparams.py
@@ -17,7 +17,7 @@ hparams = tf.contrib.training.HParams(
     },
 
     # Audio:
-    sample_rate=16000,
+    sample_rate=22050,
     silence_threshold=2,
     num_mels=80,
     fft_size=1024,
@@ -28,9 +28,9 @@ hparams = tf.contrib.training.HParams(
     ref_level_db=20,
 
     # Model:
-    layers=16,
+    layers=20,
     stacks=2,
-    residual_channels=256,
+    residual_channels=512,
     gate_channels=512,  # split into 2 gropus internally for gated activation
     skip_out_channels=256,
     dropout=1 - 0.95,
@@ -67,7 +67,7 @@ hparams = tf.contrib.training.HParams(
     # Loss
 
     # Training:
-    batch_size=1,
+    batch_size=2,
     adam_beta1=0.9,
     adam_beta2=0.999,
     adam_eps=1e-8,
@@ -81,7 +81,7 @@ hparams = tf.contrib.training.HParams(
     # This is needed for those who don't have huge GPU memory...
     # if both are None, then full audio samples are used
     max_time_sec=None,
-    max_time_steps=20000,
+    max_time_steps=8000,

In training procedure, if length of waveform is more than max_time_steps, it will be cut randomly by following code:

if max_time_steps is not None and len(x) > max_time_steps:
    s = np.random.randint(0, len(x) - max_time_steps)
    x, c = x[s:s + max_time_steps], c[s:s + max_time_steps, :]

So, the x value of first time step is not always mulaw_quantize(0).
But in generating procedure, the initial value is always mulaw_quantize(0):

initial_value = mulaw_quantize(0)
print("Intial value:", initial_value)
initial_input = to_categorical(initial_value, num_classes=256).astype(np.float32)

I think that's why training loss is low, but cannot get good generating result.

I was hoping the edge case doesn't matter. Assuming the size of respective field is 1021, we have actually zero-padded input whose length is max_time_steps + 1020: 0, 0, ..., 0, x[0], x[1], ..., x[max_time_steps-1].

From my limited experience though, initial value is not very important when we condition the model by external features.

Thanks, I'll start another experiment using your code:)

step37566.zip

At step 37566 I get:

It seems working reasonably.

@r9y9 Congratulation! but my previous experiment cannot get good results yet. Did you use duplication or transposedConv to implement up-sampling mel specs?

I review my codes, my preprocess module is different from yours, does it cause the bad result? I didn't use lws.
here is my code:

def load_wav_info(sound_file, params):
    pre_emphasis_coeff = params['pre_emphasis']

    wav, sr = librosa.load(sound_file, sr=params['sample_rate'])

    hop_length = int(params['frame_shift'] * sr)

    quantized = mulaw_quantize(wav)

    start, end = start_and_end_indices(quantized, params['silence_threshold'])

    quantized = quantized[start:end]

    wav = wav[start:end]

    y = pre_emphasis(wav, pre_emphasis_coeff)

    D = librosa.stft(y=y, n_fft=params['n_fft'],
                     hop_length=int(sr*params['frame_shift']), win_length=int(sr*params['frame_length']))

    magnitude = np.abs(D)

    filters = librosa.filters.mel(sr, params['n_fft'], n_mels=params['frame_dim'])

    mel = np.dot(filters, magnitude)

    mel = _amp_to_db(mel)

    mel = _normalize(mel, params['min_level_db'])

    mel = np.transpose(mel.astype(np.float32))

    N = mel.shape[0]

    quantized = quantized[:N * hop_length]

    return quantized, mel

@jamestang0219 I'm using transposed convolutions for upsampling.

Regarding to your code, did you implement pad_lr for librosa? My implementation is carefully designed for lws so you may need to adjust it for librosa.

wavenet_vocoder/audio.py

Lines 95 to 102 in 39961f5

    
           def lws_pad_lr(x, fsize, fshift): 
        
               """Compute left and right padding lws internally uses 
        
               """ 
        
               M = lws_num_frames(len(x), fsize, fshift) 
        
               pad = (fsize - fshift) 
        
               T = len(x) + 2 * pad 
        
               r = (M - 1) * fshift + fsize - T 
        
               return pad, pad + r

eval_step80000.zip

A full-length (~7 sec) eval output is attached. Still not very good, but it works.

@r9y9 I already deleted padding for librosa, by the way, any difference between librosa and lws for extracting mel specs? I found no frame_width in lws processor

https://librosa.github.io/librosa/generated/librosa.core.stft.html has center parameter. If center=True, I believe input signal is zero-padded. If you don't consider the padding carefully, you may have misaligned audio and mel-spectrogram.

As far as I know, due to lws is designed for phase reconstruction, lws uses careful window normalization for STFT, while librosa doesn't. However, I don't think it matters for WaveNet vocoder.

@r9y9 Hello, Using transposedConv for upsampling local conditions can get reasonable results, but duplication cannot, at least for using librosa.

Some experiment results, all based on LJSpeech:
(1)transposedConv2d, 2 stacks, 16 layers, Receptive field (samples / ms): 1021 / 46.3038548753, after 40k steps
Step: 45771, Loss: 2.19952297211
Step: 45772, Loss: 1.83744764328
Step: 45773, Loss: 2.57274341583
Epoch: 7, Avg_loss: 0.869552073245

(2)transposedConv2d, 2stacks, 20 layers, Receptive field (samples / ms): 4093 / 185.623582766, after 60k steps
Step: 65388, Loss: 0.111330501735
Step: 65389, Loss: 0.907182753086
Step: 65390, Loss: 0.0600699409842
Epoch: 10, Avg_loss: 0.263670276677

(3)duplication, 2stacks, 2 stacks, 16 layers, Receptive field (samples / ms): 1021 / 46.3038548753, after 140k steps
Step: 143845, Loss: 0.0284290295094
Step: 143846, Loss: 0.0222870074213
Step: 143847, Loss: 2.19834375381
Epoch: 11, Avg_loss: 0.172115104762

@jamestang0219 Nice! It seems transposed convolution is better than duplication as reported in the WaveNet paper.

@r9y9 I've already tried several combinations of number of layer and stack.
12 layer, 2 stacks
20 layers, 2 stacks
24 layers, 4 stacks(best MOS result in tacotron2 paper)
cannot get the results as good as Google's tacotron2 demo sample or Deepmind's original wavenet
demo sample yet after 150k+ steps.

Orignal wavenet model uses 256 classifications, but tacotron2 uses 10 components mixture of logistic distribution. We implement wavenet using original wavenet method, do you think the model is converged after 150k steps or not?

The new Parallel WaveNet paper reported they trained 1,000k steps with batch size 32 for teacher WaveNet. We may need to be more patient. In their paper, there's no mention to dropout, weight normalization, which we are using currently. There are many design choices I want to try and see how it works.

As for the mixture of logistic distributions, I'm currently working on it. See #5.

@r9y9 Great! Can't help planning to test mixture of logistic distribution loss!

mixture_test_step180000.zip

WIP: samples from #5

@r9y9 great !

@jamestang0219 I have tried the linguistic feature with wavenet vocoder the same as Deep voice 1. We got acceptable result with 20 layers and 64-bit. I think the model convergence at about 300k iterations. My learning rate is 1e-3 and decay every 1000 iterations with factor 0.998.

@mfkfge Could you please tell me how you extract the linguistic feature, thank you!

@r9y9 Nice, much better than 256 classification.

@jamestang0219 Hi, can I ask which model is the best in your experiments #1 (comment)? I'm currently trying 24 layers / 4 stacks at #5 and want to know which is the best in your case.

@r9y9 after training 180k steps, they are both just audible but noisy, no one is better than any others. I think they are not converged yet, and I'll let them train more steps. If you want, I can upload the samples from them. It may cost much time for inference.

@jamestang0219 Thanks! I will continue my experiment with 24 layers / 4 stacks.

@r9y9 I've started training tacotron model for only predicting mel specs, and merged it and wavenet with 24 layers / 4 stacks, the results are like whispers.

There must be hidden tricks which are not mentioned in the paper to make it actually work. :)

@r9y9 Yes! In tacotron paper, they didn't mention close prenet dropout in inference, It cost me much time to debug!

@jamestang0219 The feature format is just like what Deepvoice mentioned. the feature is a vector of 5 phonemes, 5 stresses, vuv and norm_logF0. we upsample the feature with qrnn.

I'm trying to use Tacotron model to predict mel spec, and use wavenet model to generate samples, it works reasonably.

Here is the merged model's result:
input sentence: "Peter Piper picked a peck of pickled peppers. How many pickled peppers did Peter Piper pick?" (just from tacotron2 demo :P)
Attention energy and predicted mel spec:

Waveform from wavenet model using predicted mel spec as local condition:

And for comparison, I attached the result from griffin lim vocoder:
Attention energy and predicted mel spec:

Linear spec:

Wav files from them:
samples.zip

Nice!!

@jamestang0219 That really sound great! Are the result generated from wavenet with mixture of logistic distributions loss? Could you please show me the way you upsampling the mel-spectrogram features? Would that be 2 transpose convolutions layers followed by a convolution layer? What about the HOP-size of mel-spectron and the sample rate of the audio?

thank you

@mfkfge No, it's from the model with cross entropy loss of 256 classifications. I'll start another experiment for mixture of logistic distribution. I used 2 transposed convolution layers for upsampling, each scale is 16, it means the hop_size should be 256.

My experiment is on LJSpeech, hyperparameters both for tacotron and wavenet:

"sample_rate" : 22050,
"n_fft" : 1024,
"ref_level_db" : 20,
"min_level_db" : -100,
"pre_emphasis" : 0.97,
"silence_threshold" : 2,
"hop_size" : 256,
"frame_dim" : 80,
"reduce_factor" : 5,

For anyone interested in mixture of logistic distributions, checkout the latest code and try the following hyper parameters:

diff --git a/hparams.py b/hparams.py
index fdf6a8f..c802974 100644
--- a/hparams.py
+++ b/hparams.py
@@ -26,8 +26,8 @@ hparams = tf.contrib.training.HParams(
     # **NOTE**: if you change the one of the two parameters below, you need to
     # re-run preprocessing before training.
     # **NOTE**: scaler input (raw or mulaw) is experimental. Use it your own risk.
-    input_type="mulaw-quantize",
-    quantize_channels=256,  # 65536 or 256
+    input_type="raw",
+    quantize_channels=65536,  # 65536 or 256

Honestly, I am not completely sure if I implemented it correctly, so I regard the feature as experimental at the moment. Let me know if you find a bug. PRs are always welcome!

inner_inner_out = inner_inner_cond * \ torch.log(torch.clamp(cdf_delta, min=1e-12)) + \ (1. - inner_inner_cond) * (log_pdf_mid - np.log(127.5))

Do you think the value 127.5 would be related to the num_classes? That is to say it can be 32767.5 when num_classes is 65536?

Well, thinking about it again, and I think (1. - inner_inner_cond) * (log_pdf_mid) would be correct. Could anybody elaborate where - np.log(127.5) come from?

@r9y9 log(p) - log((num_classes - 1) / 2) = log(p * (2 / (num_classes - 1))) and it's just the rectangle area under the pdf near the center of the bin.

@npuichigo Thanks! I was thinking it would be log(p* δ(x)) for the extreme case, but that makes sense for discretized distributions.

https://r9y9.github.io/wavenet_vocoder/

Uploaded samples of a model trained with mixture of logistic distributions loss.

Further trained a bit more and uploaded samples again. I think I got pretty good samples.

I also think those are pretty good! I guessing that it would be easier to train on a dataset that doesn't have reverb and the background (electronic?) buzz.

But regardless of that I think you should be really proud of what you have achived because this is the best I've ever heard the LJSpeech dataset sound!

hello, thanks to your work, i use the ibab's tensorflow code and your pre process code to try wavenet vocoder, the generated ljspeech wav is pretty good, but the arctic has electronic buzz in the background(little).
now, I am comparing your step with my to fix the noise problem...
the chinese wav is 48k.

ljspeech_generate.zip
chinese.zip
cmu_arctic.zip

Very nice! The Chinese samples sound pretty good to me.

I also noticed some noise (#3) while I was working on CMU ARCTIC. I haven't investigated deeply yet, but I am wondering 1~2 hours data is not sufficient to train WaveNet.

@jamestang0219 when you decoded Tacotron produced mel-spectrograms with the wavenet decoder, what loss did the Tacotron model have?

hello @azraelkuan
I do the same thing with you.I wonder how did you add the local condition to the ibab' tensorflow. Does the shape of the local condition is equal to the input batch?

@r9y9 I used the same preprocess and model as yours, I found that my averaged loss per epoch kept decreasing to 0.00x, and yours was stable at 2.x, but the evaluation samples from mine is not as good as yours, that's very strange...

Did you use my latest code? I implemented mixture of logistic distributions loss as well as exponential model averaging in #5. According to the Parallel WaveNet paper, exponential model averaging is important for quality.

One difference would be training time. I did finetune the model many times. i.e., train 200k steps -> (change some hyper param and let's see how it works) -> train 200k step (lr starts from initial value) -> ... repeated. This might lead faster convergence.

If I remember correctly I trained the model for over 1000k steps In total. The log for the final 400k steps I had locally is attached for reference.

Never mind the spike in the loss curve. I used high variance lower bound (say 1e-4),
which turned out very bad. I used 1e-14 for the variance lower bound after that.

wavenet_vocoder/hparams.py

Line 57 in 8c5f8cb

log_scale_min=float(np.log(1e-14)),

mixture of logistic distribution version just started, I'm comparing 256 mu-law version. Maybe something wrong with my shuffle procedure, let me double check.

@r9y9 Would it be okay to use such a low variance lower bound (1e-14)? In my experiment, too small variance may lead the training process collapse. I mean the loss may increase to NaN due to overflow.

@npuichigo I worried about that but I have never got NaN until now. So I would say it's okay.

@JK1532 you can just repeat the local condition by the frameshift, or can also use several transposed convolution layers

https://r9y9.github.io/wavenet_vocoder/

Update samples of multi-speaker WN. Used mixture of logistic distributions. It was quite costly to train.. Also added ground truth audio samples for ease of comparison.

@r9y9 what do you mean by costly do train? what are the biggest challenges?

I meant it's much time consuming. It took a week or more to get sufficient good quality for LJSpeech and CMU ARCTIC.

Can you share the loss curve?

I’m in a short business trip and do not have access to my GPU PC right now. I can share when I come back home after a week.

That's great, Ryuchi! Thank you@

In the original Salimans pixel-cnn++ code the loss is converted to bits per output dimension which is actually quite handy for comparison with other implementations and experiments. For this just divide the loss by the dimensionality of the output * ln(2). How many bits is the model able to predict?

This is unclear to me, probably because I haven't read the paper. You're saying that n_bits = loss / C, that is the higher the loss the more bits the model can output? On Feb 19, 2018 1:36 PM, "bliep" <notifications@github.com> wrote: In the original Salimans pixel-cnn++ code the loss is converted to bits per output dimension which is actually quite handy for comparison with other implementations and experiments. For this just divide the loss by the dimensionality of the output * ln(2). How many bits is the model able to predict? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACMij7ep9ln_wukDD2lFUKJr_ihgIZOXks5tWelNgaJpZM4RPxW4> .

The loss is the negative log probability, and averaged over the output dimension it is an estimate of the entropy in a sample. In the original paper (predicting pixels in an image) the residual entropy was around 3 bits (out of 8, so predicting 5 bits). Since it is not easy for me to figure out the output dimension of this wavenet implementation, a loss of 56-57 doesn't tell me much.
(see https://github.com/openai/pixel-cnn/blob/master/train.py#L148)

I see now, it just the loss but normalized to bits, thus facilitating comparison as you mentioned!

From what I understand the model has 10 mixture of logistics with 3 params each (pi, mean, log-scale), producing a total of 30 channels.

This is what I understand from what @r9y9 has on the hparams.py file https://github.com/r9y9/wavenet_vocoder/blob/master/hparams.py

@r9y9
I tested LJSpeech using latest code.(~1000K) but it slightly noisy..
Is same Latest code to updated sample setting? (https://r9y9.github.io/wavenet_vocoder/)
checkpoint_step001000000.zip

Yes, current master is the latest one and this is what I locally have. Maybe training procedure I described in #1 (comment) is important for quality.

@r9y9 would you mind re-sharing your weights for the mel-conditioned wavenet? The link you shared earlier is broken. Thanks!

@dyelax Can you check the links in #19 instead?

@r9y9 for multi gpu training, i test that we only need to fix

wavenet_vocoder/train.py

Line 639 in 26e4305

y_hat = model(x, c=c, g=g, softmax=False)

to y_hat = torch.nn.parallel.data_parallel(model, (x, c, g, False)) and increase the num_workers, batch_size
Also, we can set the device_ids and output_device for different cmd args

Efficient Neural Audio Synthesis https://arxiv.org/abs/1802.08435
Lots of interesting tricks and the claim is real-time on a mobile cpu due to weight pruning.

https://github.com/r9y9/wavenet_vocoder#pre-trained-models Added link to pre-trained models.

Hi @r9y9, Thank you so much for sharing your work.

We have followed yours and got some results in Tensorflow. While we have not many tested yet, It works in the same parameters as yours except without Dropout, WeightNorm techniques. You can find some results in here. If I get another information during testing, I'll let you know about it. Thanks!

@twidddj Nice! I'm looking forward to your results.

I think I can close this now. Discussion on remained issues (e.g, DeepVoice + WaveNet) can continue on specific issue.

	upsample_factor = quantized.size // mel.shape[0]
	mel = np.repeat(mel, upsample_factor, axis=0)

	# If True, use transposed convolutions to upsample conditional features,
	# otherwise repeat features to adjast time resolution
	upsample_conditional_features=False,
	# should np.prod(upsample_scales) == hop_size
	upsample_scales=[16, 16],

	# Audio:
	sample_rate=16000,
	silence_threshold=2,
	num_mels=80,
	fft_size=1024,
	# shift can be specified by either hop_size or frame_shift_ms
	hop_size=256,
	frame_shift_ms=None,
	min_level_db=-100,
	ref_level_db=20,

	if upsample_conditional_features:
	self.upsample_conv = nn.ModuleList()
	for s in upsample_scales:
	self.upsample_conv.append(ConvTranspose1d(
	cin_channels, cin_channels, kernel_size=s, padding=0,
	dilation=1, stride=s, std_mul=1.0))
	# Is this non-lineality necessary?
	self.upsample_conv.append(nn.ReLU(inplace=True))

	def melspectrogram(y):
	D = _lws_processor().stft(y).T
	S = _amp_to_db(_linear_to_mel(np.abs(D)))
	return _normalize(S)

	if padding is None:
	# no future time stamps available
	if causal:
	padding = (kernel_size - 1) * dilation
	else:
	padding = (kernel_size - 1) // 2 * dilation

	def lws_pad_lr(x, fsize, fshift):
	"""Compute left and right padding lws internally uses
	"""
	M = lws_num_frames(len(x), fsize, fshift)
	pad = (fsize - fshift)
	T = len(x) + 2 * pad
	r = (M - 1) * fshift + fsize - T
	return pad, pad + r

Planned TODOs

Goal

Model

Training script

Experiments

Misc

Sampling frequency

Advanced (lower priority)