Training WaveRNN with a different data set

Question

Training WaveRNN with a different data set

ivancarapinha opened this issue 4 years ago · comments

Hello, I am interested in WaveRNN as a vocoder (not TTS). I am training a model with a different data set from LJSpeech and the loss seems to have stabilized around a high value (around 6) with 450k trained steps. The .wav files from my data set have a 16 kHz sampling rate instead of 22.05 kHz. My question is, what considerations should I take when training WaveRNN with a different data set? Should I only change the sampling rate in the hparams.py file, or adjust other values as well? Thank you very much

Ollie McCarthy · Answer 1 · Mon Feb 24 2020 15:45:20 GMT+0800 (China Standard Time)

Hi, if your dataset is 16khz it will be automatically upsampled to 22khz if you leave the hparams as they are. Obviously that is kinda of a waste since your 16khz would mean you'd get faster inference.
So if you do want to change the hparams to 16khz - you might want to adjust the hop length / window length / upsample factors and n_fft to suit your needs.

Also if you are new to this model you can train 8bit RAW mode really fast just to verify everything is working well. MOL mode takes quite a bit to converged (around 800k+ steps).

ivancarapinha · Answer 2 · Mon Feb 24 2020 20:46:30 GMT+0800 (China Standard Time)

Thank you for the advice!

ivancarapinha · Answer 3 · Sun Mar 01 2020 23:49:19 GMT+0800 (China Standard Time)

Hello again,

I trained WaveRNN 8-bit, RAW without mu-law for 250k steps, which apparently would be enough, and the samples generated in training seem too noisy. The hparams I used are specified below. I have two questions regarding this problem:

- Should I adjust any DSP/vocoder hyperparameters, such as n_fft (as I haven't tried yet for n_fft=win_length), or voc_upsample_factors (e.g. voc_upsample_factors= (10, 20) ), or should I keep training the model, but with lower values of learning rate?
- Does the size of the data set influence the training process in terms of quality? I ask this because I am using a ~14-hour data set with 7300 utterances, which is a little more than half the size of the LJSpeech data set.

Thank you.

# DSP -----------------------------------
# Settings for all models
sample_rate = 16000
n_fft = 1024
fft_bins = n_fft // 2 + 1
num_mels = 80
hop_length = 200                    # 12.5ms - in line with Tacotron 2 paper
win_length = 800                    # 50ms - same reason as above
fmin = 40
min_level_db = -100
ref_level_db = 20
bits = 8                           
mu_law = False                
peak_norm = False                 

# WAVERNN / VOCODER --------------------------------------------------
# Model Hparams
voc_mode = 'RAW'                    # either 'RAW' (softmax on raw bits) or 'MOL' (sample from mixture of logistics)
voc_upsample_factors = (5, 5, 8)   # NB - this needs to correctly factorise hop_length
voc_rnn_dims = 512
voc_fc_dims = 512
voc_compute_dims = 128
voc_res_out_dims = 128
voc_res_blocks = 10

# Training
voc_batch_size = 32
voc_lr = 1e-4
voc_checkpoint_every = 25_000
voc_gen_at_checkpoint = 5           
voc_total_steps = 1_000_000        
voc_test_samples = 50               
voc_pad = 2                         
voc_seq_len = hop_length * 5        
voc_clip_grad_norm = 4

Xin Detai · Answer 4 · Thu Mar 26 2020 12:34:15 GMT+0800 (China Standard Time)

Hello again,

I trained WaveRNN 8-bit, RAW without mu-law for 250k steps, which apparently would be enough, and the samples generated in training seem too noisy. The hparams I used are specified below. I have two questions regarding this problem:

Should I adjust any DSP/vocoder hyperparameters, such as n_fft (as I haven't tried yet for n_fft=win_length), or voc_upsample_factors (e.g. voc_upsample_factors= (10, 20) ), or should I keep training the model, but with lower values of learning rate?

Does the size of the data set influence the training process in terms of quality? I ask this because I am using a ~14-hour data set with 7300 utterances, which is a little more than half the size of the LJSpeech data set.

Thank you.
# DSP -----------------------------------
# Settings for all models
sample_rate = 16000
n_fft = 1024
fft_bins = n_fft // 2 + 1
num_mels = 80
hop_length = 200                    # 12.5ms - in line with Tacotron 2 paper
win_length = 800                    # 50ms - same reason as above
fmin = 40
min_level_db = -100
ref_level_db = 20
bits = 8                           
mu_law = False                
peak_norm = False                 

# WAVERNN / VOCODER --------------------------------------------------
# Model Hparams
voc_mode = 'RAW'                    # either 'RAW' (softmax on raw bits) or 'MOL' (sample from mixture of logistics)
voc_upsample_factors = (5, 5, 8)   # NB - this needs to correctly factorise hop_length
voc_rnn_dims = 512
voc_fc_dims = 512
voc_compute_dims = 128
voc_res_out_dims = 128
voc_res_blocks = 10

# Training
voc_batch_size = 32
voc_lr = 1e-4
voc_checkpoint_every = 25_000
voc_gen_at_checkpoint = 5           
voc_total_steps = 1_000_000        
voc_test_samples = 50               
voc_pad = 2                         
voc_seq_len = hop_length * 5        
voc_clip_grad_norm = 4              

Could you please tell me how you adapt this model to 16khz data?
Thank you.

ivancarapinha · Answer 5 · Thu Mar 26 2020 22:03:43 GMT+0800 (China Standard Time)

In my case, I realized that adjusting the DSP parameters that @fatchord recommended in the comment above and using peak_norm = True was enough.

I first tried to train the model in RAW mode, with 8-bit/9-bit configurations, but the results were poor. So finally I tried training the model in MOL mode, and surprisingly I obtained good results very fast, in about 200k steps. I immediately experienced improvements in terms of the naturalness of generated speech (at about 50k-75k steps). Then, as the training process progressed, I witnessed that the naturalness of speech kept improving, but the intelligibility severely decreased. So I decreased the learning rate (from 1e-4 to 5e-5), just like @fatchord recommended in other issues, and the intelligibility drastically improved again. Then I would wait for this phenomenon to happen again (improvement of naturalness and deterioration of intelligibility) to repeat this process. Eventually, the results didn't change anymore and I assumed the model converged.

I used the following hyperparameters:

# DSP
sample_rate = 16000
n_fft = 1024                        
fft_bins = n_fft // 2 + 1
num_mels = 80
hop_length = 200                   
win_length = 800                  
fmin = 40
min_level_db = -100
ref_level_db = 20
bits = 9                            
mu_law = True                       
peak_norm = True

# WAVERNN / VOCODER -----

# Model Hparams
voc_mode = 'MOL'                    # either 'RAW' (softmax on raw bits) or 'MOL' (sample from mixture of logistics)
voc_upsample_factors = (5, 5, 8)   # this needs to correctly factorise hop_length
voc_rnn_dims = 512
voc_fc_dims = 512
voc_compute_dims = 128
voc_res_out_dims = 128
voc_res_blocks = 10

# Training
voc_batch_size = 32
voc_lr = 1e-4
voc_checkpoint_every = 25_000
voc_gen_at_checkpoint = 5           # number of samples to generate at each checkpoint
voc_total_steps = 1_000_000         # Total number of training steps
voc_test_samples = 50               # How many unseen samples to put aside for testing
voc_pad = 2                         # this will pad the input so that the resnet can 'see' wider than input length
voc_seq_len = hop_length * 5        # must be a multiple of hop_length
voc_clip_grad_norm = 4              # set to None if no gradient clipping needed

# Generating / Synthesizing
voc_gen_batched = True              # very fast (realtime+) single utterance batched generation
voc_target = 11_000                 # target number of samples to be generated in each batch entry
voc_overlap = 550                   # number of samples for crossfading between batches

Xin Detai · Answer 6 · Thu Mar 26 2020 22:09:34 GMT+0800 (China Standard Time)

In my case, I realized that adjusting the DSP parameters that @fatchord recommended in the comment above and using peak_norm = True was enough.

I first tried to train the model in RAW mode, with 8-bit/9-bit configurations, but the results were poor. So finally I tried training the model in MOL mode, and surprisingly I obtained good results very fast, in about 200k steps. I immediately experienced improvements in terms of the naturalness of generated speech (at about 50k-75k steps). Then, as the training process progressed, I witnessed that the naturalness of speech kept improving, but the intelligibility severely decreased. So I decreased the learning rate (from 1e-4 to 5e-5), just like @fatchord recommended in other issues, and the intelligibility drastically improved again. Then I would wait for this phenomenon to happen again (improvement of naturalness and deterioration of intelligibility) to repeat this process. Eventually, the results didn't change anymore and I assumed the model converged.

I used the following hyperparameters:
# DSP
sample_rate = 16000
n_fft = 1024                        
fft_bins = n_fft // 2 + 1
num_mels = 80
hop_length = 200                   
win_length = 800                  
fmin = 40
min_level_db = -100
ref_level_db = 20
bits = 9                            
mu_law = True                       
peak_norm = True

# WAVERNN / VOCODER -----

# Model Hparams
voc_mode = 'MOL'                    # either 'RAW' (softmax on raw bits) or 'MOL' (sample from mixture of logistics)
voc_upsample_factors = (5, 5, 8)   # this needs to correctly factorise hop_length
voc_rnn_dims = 512
voc_fc_dims = 512
voc_compute_dims = 128
voc_res_out_dims = 128
voc_res_blocks = 10

# Training
voc_batch_size = 32
voc_lr = 1e-4
voc_checkpoint_every = 25_000
voc_gen_at_checkpoint = 5           # number of samples to generate at each checkpoint
voc_total_steps = 1_000_000         # Total number of training steps
voc_test_samples = 50               # How many unseen samples to put aside for testing
voc_pad = 2                         # this will pad the input so that the resnet can 'see' wider than input length
voc_seq_len = hop_length * 5        # must be a multiple of hop_length
voc_clip_grad_norm = 4              # set to None if no gradient clipping needed

# Generating / Synthesizing
voc_gen_batched = True              # very fast (realtime+) single utterance batched generation
voc_target = 11_000                 # target number of samples to be generated in each batch entry
voc_overlap = 550                   # number of samples for crossfading between batches

Thank you, it really helps me.

Aayush Bajaj · Answer 7 · Fri Feb 26 2021 16:54:43 GMT+0800 (China Standard Time)

In my case, I realized that adjusting the DSP parameters that @fatchord recommended in the comment above and using peak_norm = True was enough.

I first tried to train the model in RAW mode, with 8-bit/9-bit configurations, but the results were poor. So finally I tried training the model in MOL mode, and surprisingly I obtained good results very fast, in about 200k steps. I immediately experienced improvements in terms of the naturalness of generated speech (at about 50k-75k steps). Then, as the training process progressed, I witnessed that the naturalness of speech kept improving, but the intelligibility severely decreased. So I decreased the learning rate (from 1e-4 to 5e-5), just like @fatchord recommended in other issues, and the intelligibility drastically improved again. Then I would wait for this phenomenon to happen again (improvement of naturalness and deterioration of intelligibility) to repeat this process. Eventually, the results didn't change anymore and I assumed the model converged.

I used the following hyperparameters:
# DSP
sample_rate = 16000
n_fft = 1024                        
fft_bins = n_fft // 2 + 1
num_mels = 80
hop_length = 200                   
win_length = 800                  
fmin = 40
min_level_db = -100
ref_level_db = 20
bits = 9                            
mu_law = True                       
peak_norm = True

# WAVERNN / VOCODER -----

# Model Hparams
voc_mode = 'MOL'                    # either 'RAW' (softmax on raw bits) or 'MOL' (sample from mixture of logistics)
voc_upsample_factors = (5, 5, 8)   # this needs to correctly factorise hop_length
voc_rnn_dims = 512
voc_fc_dims = 512
voc_compute_dims = 128
voc_res_out_dims = 128
voc_res_blocks = 10

# Training
voc_batch_size = 32
voc_lr = 1e-4
voc_checkpoint_every = 25_000
voc_gen_at_checkpoint = 5           # number of samples to generate at each checkpoint
voc_total_steps = 1_000_000         # Total number of training steps
voc_test_samples = 50               # How many unseen samples to put aside for testing
voc_pad = 2                         # this will pad the input so that the resnet can 'see' wider than input length
voc_seq_len = hop_length * 5        # must be a multiple of hop_length
voc_clip_grad_norm = 4              # set to None if no gradient clipping needed

# Generating / Synthesizing
voc_gen_batched = True              # very fast (realtime+) single utterance batched generation
voc_target = 11_000                 # target number of samples to be generated in each batch entry
voc_overlap = 550                   # number of samples for crossfading between batches

What do peak_norm does, I know it normalizes the peak of each file but where is it implemented? couldn't find it in any file @fatchord