TTS not generating output even after 900k steps of tacotron model
greatforu opened this issue · comments
Hi I have used 2677 audios of 20 different speakers
tacotron Model is 900k trained with Loss 0.401 and Wavernn Model is 1000k with Loss 3.34. and the wav file generated is not audible output. Should I train it more?
Any help would be appreciated!
Thanks
Hi there, what do the attention plots look like?
Blank attention plots (like the one you show) mean the model has failed for some reason. Maybe try finetuning on top of the pretrained model - since that has already learned attention.
Hi this is my current attention plot , In the LSA module i have changed the sigmoid activation to softmax (scores = F.softmax(u, dim=1)).
and reduction factor to 18 with batch size 32.
If i use reduction factor = 2 then, batch size could not be more than 10 and after that it is showing insufficient memory and I'm using 2 GPU's Nvidia GeForce RTX 2080 8gb each. Where I'm doing wrong please do let me know and also how to increase batch size on less reduction factor
these are the hparams i'm using and audio clips are clipped using webrtcvad with 2 aggressiveness and audios are of 22050 Hz sample rate and 352 kbps Bit rate.
Settings for all models
sample_rate = 22050
n_fft = 2048
fft_bins = n_fft // 2 + 1
num_mels = 80
hop_length = 275 # 12.5ms - in line with Tacotron 2 paper
win_length = 1100 # 50ms - same reason as above
fmin = 40
min_level_db = -100
ref_level_db = 20
bits = 8 # bit depth of signal
mu_law = False # Recommended to suppress noise if using raw bits in hp.voc_mode below
peak_norm = True # Normalise to the peak of each wav file
voc_mode = 'MOL' # either 'RAW' (softmax on raw bits) or 'MOL' (sample from mixture of logistics)
voc_mode = 'RAW'
voc_upsample_factors = (5, 5, 11) # NB - this needs to correctly factorise hop_length
voc_rnn_dims = 512
voc_fc_dims = 512
voc_compute_dims = 128
voc_res_out_dims = 128
voc_res_blocks = 10
voc_batch_size = 80
voc_lr = 1e-4
voc_checkpoint_every = 25_000
voc_gen_at_checkpoint = 5 # number of samples to generate at each checkpoint
voc_total_steps = 1_200_000 # Total number of training steps
voc_test_samples = 50 # How many unseen samples to put aside for testing
voc_pad = 2 # this will pad the input so that the resnet can 'see' wider than input length
voc_seq_len = hop_length * 5 # must be a multiple of hop_length
voc_clip_grad_norm = 4 # set to None if no gradient clipping needed
voc_clip_grad_norm = 3
Generating / Synthesizing
voc_gen_batched = True # very fast (realtime+) single utterance batched generation
voc_target = 11_000 # target number of samples to be generated in each batch entry
voc_overlap = 550 # number of samples for crossfading between batches
Model Hparams
tts_embed_dims = 256 # embedding dimension for the graphemes/phoneme inputs
tts_encoder_dims = 128
tts_decoder_dims = 256
tts_postnet_dims = 128
tts_encoder_K = 16
tts_lstm_dims = 512
tts_postnet_K = 8
tts_num_highways = 4
tts_dropout = 0.5
tts_cleaner_names = ['english_cleaners']
tts_stop_threshold = -3.4 # Value below which audio generation ends.
# For example, for a range of [-4, 4], this
# will terminate the sequence at the first
# frame that has all values < -3.4
Training
tts_schedule = [(7, 1e-3, 10_000, 16), # progressive training schedule
(5, 1e-4, 100_000, 16), # (r, lr, step, batch_size)
(2, 1e-4, 180_000, 8),
(18, 5e-5, 15_00_000, 32)]
tts_max_mel_len = 1250 # if you have a couple of extremely long spectrograms you might want to use this
tts_bin_lengths = True # bins the spectrogram lengths before sampling in data loader - speeds up training
tts_clip_grad_norm = 1.0 # clips the gradient norm to prevent explosion - set to None if not needed
tts_checkpoint_every = 10_000 # checkpoints the model every X steps