Confused about the mismatch between max_mel_frames and training sample length in hparams.py

Question

Confused about the mismatch between max_mel_frames and training sample length in hparams.py

jjoe1 opened this issue 4 years ago · comments

First thanks for this detailed model and the wiki.

I've been trying to read the wiki, this code, and the original tacotron-2 paper at https://arxiv.org/abs/1712.05884 during the last 2 days. As someone trying to learn text-to-speech models, I'm still very unclear about how the spectrogram of fixed-length is generated for a input text during training.

If the ground-truth clip is 14sec, then shouldn't the max_mel_frames in it be 14*(1/0.0125)= 14*80=1120.

Why does hparams.py have mex_mel_frames 900? And what is max_sentence_length of the input-text after padding? I assume all the input sentences used during training and inference would be padded to a max_len, is that correct?
Another related issue which may be a beginner question: After encoder creates 256 hidden states (from 256 bidirectional lstms), isn't the decoder output limited to 256 frames (for output layer reduction factor r=1)? if decoder is producing 1 frame per encoder state as r=1, how does it produce more frames than encoder states?

hparams.py:

#train samples of lengths between 3sec and 14sec are more than enough to make a model capable of generating consistent speech.
clip_mels_length = True, #For cases of OOM (Not really recommended, only use if facing unsolvable OOM errors, also consider clipping your samples to smaller chunks)
max_mel_frames = 900, #Only relevant when clip_mels_length = True, please only use after trying output_per_steps=3 and still getting OOM errors.