ming024 / FastSpeech2

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fine-tuning issue

zhoufqing opened this issue · comments

When I use LJSpeech's 900000. pth. tar as a pre training model and fine tune my own data, I use model.load_state_dict(torch.load('./output/ckpt/LJSpeech/900000.pth.tar')) to load the pre trained model, but an error occurred in the subsequent code
RuntimeError: Error(s) in loading state_dict for FastSpeech2:
Missing key(s) in state_dict: "encoder.position_enc", "encoder.src_word_emb.weight", "encoder.layer_stack.0.slf_attn.w_qs.weight", "encoder.layer_stack.0.slf_attn.w_qs.bias", "encoder.layer_stack.0.slf_attn.w_ks.weight", "encoder.layer_stack.0.slf_attn.w_ks.bias", "encoder.layer_stack.0.slf_attn.w_vs.weight", "encoder.layer_stack.0.slf_attn.w_vs.bias", "encoder.layer_stack.0.slf_attn.layer_norm.weight", "encoder.layer_stack.0.slf_attn.layer_norm.bias", "encoder.layer_stack.0.slf_attn.fc.weight", "encoder.layer_stack.0.slf_attn.fc.bias", "encoder.layer_stack.0.pos_ffn.w_1.weight", "encoder.layer_stack.0.pos_ffn.w_1.bias", "encoder.layer_stack.0.pos_ffn.w_2.weight", "encoder.layer_stack.0.pos_ffn.w_2.bias", "encoder.layer_stack.0.pos_ffn.layer_norm.weight", "encoder.layer_stack.0.pos_ffn.layer_norm.bias", "encoder.layer_stack.1.slf_attn.w_qs.weight", "encoder.layer_stack.1.slf_attn.w_qs.bias", "encoder.layer_stack.1.slf_attn.w_ks.weight", "encoder.layer_stack.1.slf_attn.w_ks.bias", "encoder.layer_stack.1.slf_attn.w_vs.weight", "encoder.layer_stack.1.slf_attn.w_vs.bias", "encoder.layer_stack.1.slf_attn.layer_norm.weight", "encoder.layer_stack.1.slf_attn.layer_norm.bias", "encoder.layer_stack.1.slf_attn.fc.weight", "encoder.layer_stack.1.slf_attn.fc.bias", "encoder.layer_stack.1.pos_ffn.w_1.weight", "encoder.layer_stack.1.pos_ffn.w_1.bias", "encoder.layer_stack.1.pos_ffn.w_2.weight", "encoder.layer_stack.1.pos_ffn.w_2.bias", "encoder.layer_stack.1.pos_ffn.layer_norm.weight", "encoder.layer_stack.1.pos_ffn.layer_norm.bias", "encoder.layer_stack.2.slf_attn.w_qs.weight", "encoder.layer_stack.2.slf_attn.w_qs.bias", "encoder.layer_stack.2.slf_attn.w_ks.weight", "encoder.layer_stack.2.slf_attn.w_ks.bias", "encoder.layer_stack.2.slf_attn.w_vs.weight", "encoder.layer_stack.2.slf_attn.w_vs.bias", "encoder.layer_stack.2.slf_attn.layer_norm.weight", "encoder.layer_stack.2.slf_attn.layer_norm.bias", "encoder.layer_stack.2.slf_attn.fc.weight", "encoder.layer_stack.2.slf_attn.fc.bias", "encoder.layer_stack.2.pos_ffn.w_1.weight", "encoder.layer_stack.2.pos_ffn.w_1.bias", "encoder.layer_stack.2.pos_ffn.w_2.weight", "encoder.layer_stack.2.pos_ffn.w_2.bias", "encoder.layer_stack.2.pos_ffn.layer_norm.weight", "encoder.layer_stack.2.pos_ffn.layer_norm.bias", "encoder.layer_stack.3.slf_attn.w_qs.weight", "encoder.layer_stack.3.slf_attn.w_qs.bias", "encoder.layer_stack.3.slf_attn.w_ks.weight", "encoder.layer_stack.3.slf_attn.w_ks.bias", "encoder.layer_stack.3.slf_attn.w_vs.weight", "encoder.layer_stack.3.slf_attn.w_vs.bias", "encoder.layer_stack.3.slf_attn.layer_norm.weight", "encoder.layer_stack.3.slf_attn.layer_norm.bias", "encoder.layer_stack.3.slf_attn.fc.weight", "encoder.layer_stack.3.slf_attn.fc.bias", "encoder.layer_stack.3.pos_ffn.w_1.weight", "encoder.layer_stack.3.pos_ffn.w_1.bias", "encoder.layer_stack.3.pos_ffn.w_2.weight", "encoder.layer_stack.3.pos_ffn.w_2.bias", "encoder.layer_stack.3.pos_ffn.layer_norm.weight", "encoder.layer_stack.3.pos_ffn.layer_norm.bias", "variance_adaptor.pitch_bins", "variance_adaptor.energy_bins", "variance_adaptor.duration_predictor.conv_layer.conv1d_1.conv.weight", "variance_adaptor.duration_predictor.conv_layer.conv1d_1.conv.bias", "variance_adaptor.duration_predictor.conv_layer.layer_norm_1.weight", "variance_adaptor.duration_predictor.conv_layer.layer_norm_1.bias", "variance_adaptor.duration_predictor.conv_layer.conv1d_2.conv.weight", "variance_adaptor.duration_predictor.conv_layer.conv1d_2.conv.bias", "variance_adaptor.duration_predictor.conv_layer.layer_norm_2.weight", "variance_adaptor.duration_predictor.conv_layer.layer_norm_2.bias", "variance_adaptor.duration_predictor.linear_layer.weight", "variance_adaptor.duration_predictor.linear_layer.bias", "variance_adaptor.pitch_predictor.conv_layer.conv1d_1.conv.weight", "variance_adaptor.pitch_predictor.conv_layer.conv1d_1.conv.bias", "variance_adaptor.pitch_predictor.conv_layer.layer_norm_1.weight", "variance_adaptor.pitch_predictor.conv_layer.layer_norm_1.bias", "variance_adaptor.pitch_predictor.conv_layer.conv1d_2.conv.weight", "variance_adaptor.pitch_predictor.conv_layer.conv1d_2.conv.bias", "variance_adaptor.pitch_predictor.conv_layer.layer_norm_2.weight", "variance_adaptor.pitch_predictor.conv_layer.layer_norm_2.bias", "variance_adaptor.pitch_predictor.linear_layer.weight", "variance_adaptor.pitch_predictor.linear_layer.bias", "variance_adaptor.energy_predictor.conv_layer.conv1d_1.conv.weight", "variance_adaptor.energy_predictor.conv_layer.conv1d_1.conv.bias", "variance_adaptor.energy_predictor.conv_layer.layer_norm_1.weight", "variance_adaptor.energy_predictor.conv_layer.layer_norm_1.bias", "variance_adaptor.energy_predictor.conv_layer.conv1d_2.conv.weight", "variance_adaptor.energy_predictor.conv_layer.conv1d_2.conv.bias", "variance_adaptor.energy_predictor.conv_layer.layer_norm_2.weight", "variance_adaptor.energy_predictor.conv_layer.layer_norm_2.bias", "variance_adaptor.energy_predictor.linear_layer.weight", "variance_adaptor.energy_predictor.linear_layer.bias", "variance_adaptor.pitch_embedding.weight", "variance_adaptor.energy_embedding.weight", "decoder.position_enc", "decoder.layer_stack.0.slf_attn.w_qs.weight", "decoder.layer_stack.0.slf_attn.w_qs.bias", "decoder.layer_stack.0.slf_attn.w_ks.weight", "decoder.layer_stack.0.slf_attn.w_ks.bias", "decoder.layer_stack.0.slf_attn.w_vs.weight", "decoder.layer_stack.0.slf_attn.w_vs.bias", "decoder.layer_stack.0.slf_attn.layer_norm.weight", "decoder.layer_stack.0.slf_attn.layer_norm.bias", "decoder.layer_stack.0.slf_attn.fc.weight", "decoder.layer_stack.0.slf_attn.fc.bias", "decoder.layer_stack.0.pos_ffn.w_1.weight", "decoder.layer_stack.0.pos_ffn.w_1.bias", "decoder.layer_stack.0.pos_ffn.w_2.weight", "decoder.layer_stack.0.pos_ffn.w_2.bias", "decoder.layer_stack.0.pos_ffn.layer_norm.weight", "decoder.layer_stack.0.pos_ffn.layer_norm.bias", "decoder.layer_stack.1.slf_attn.w_qs.weight", "decoder.layer_stack.1.slf_attn.w_qs.bias", "decoder.layer_stack.1.slf_attn.w_ks.weight", "decoder.layer_stack.1.slf_attn.w_ks.bias", "decoder.layer_stack.1.slf_attn.w_vs.weight", "decoder.layer_stack.1.slf_attn.w_vs.bias", "decoder.layer_stack.1.slf_attn.layer_norm.weight", "decoder.layer_stack.1.slf_attn.layer_norm.bias", "decoder.layer_stack.1.slf_attn.fc.weight", "decoder.layer_stack.1.slf_attn.fc.bias", "decoder.layer_stack.1.pos_ffn.w_1.weight", "decoder.layer_stack.1.pos_ffn.w_1.bias", "decoder.layer_stack.1.pos_ffn.w_2.weight", "decoder.layer_stack.1.pos_ffn.w_2.bias", "decoder.layer_stack.1.pos_ffn.layer_norm.weight", "decoder.layer_stack.1.pos_ffn.layer_norm.bias", "decoder.layer_stack.2.slf_attn.w_qs.weight", "decoder.layer_stack.2.slf_attn.w_qs.bias", "decoder.layer_stack.2.slf_attn.w_ks.weight", "decoder.layer_stack.2.slf_attn.w_ks.bias", "decoder.layer_stack.2.slf_attn.w_vs.weight", "decoder.layer_stack.2.slf_attn.w_vs.bias", "decoder.layer_stack.2.slf_attn.layer_norm.weight", "decoder.layer_stack.2.slf_attn.layer_norm.bias", "decoder.layer_stack.2.slf_attn.fc.weight", "decoder.layer_stack.2.slf_attn.fc.bias", "decoder.layer_stack.2.pos_ffn.w_1.weight", "decoder.layer_stack.2.pos_ffn.w_1.bias", "decoder.layer_stack.2.pos_ffn.w_2.weight", "decoder.layer_stack.2.pos_ffn.w_2.bias", "decoder.layer_stack.2.pos_ffn.layer_norm.weight", "decoder.layer_stack.2.pos_ffn.layer_norm.bias", "decoder.layer_stack.3.slf_attn.w_qs.weight", "decoder.layer_stack.3.slf_attn.w_qs.bias", "decoder.layer_stack.3.slf_attn.w_ks.weight", "decoder.layer_stack.3.slf_attn.w_ks.bias", "decoder.layer_stack.3.slf_attn.w_vs.weight", "decoder.layer_stack.3.slf_attn.w_vs.bias", "decoder.layer_stack.3.slf_attn.layer_norm.weight", "decoder.layer_stack.3.slf_attn.layer_norm.bias", "decoder.layer_stack.3.slf_attn.fc.weight", "decoder.layer_stack.3.slf_attn.fc.bias", "decoder.layer_stack.3.pos_ffn.w_1.weight", "decoder.layer_stack.3.pos_ffn.w_1.bias", "decoder.layer_stack.3.pos_ffn.w_2.weight", "decoder.layer_stack.3.pos_ffn.w_2.bias", "decoder.layer_stack.3.pos_ffn.layer_norm.weight", "decoder.layer_stack.3.pos_ffn.layer_norm.bias", "decoder.layer_stack.4.slf_attn.w_qs.weight", "decoder.layer_stack.4.slf_attn.w_qs.bias", "decoder.layer_stack.4.slf_attn.w_ks.weight", "decoder.layer_stack.4.slf_attn.w_ks.bias", "decoder.layer_stack.4.slf_attn.w_vs.weight", "decoder.layer_stack.4.slf_attn.w_vs.bias", "decoder.layer_stack.4.slf_attn.layer_norm.weight", "decoder.layer_stack.4.slf_attn.layer_norm.bias", "decoder.layer_stack.4.slf_attn.fc.weight", "decoder.layer_stack.4.slf_attn.fc.bias", "decoder.layer_stack.4.pos_ffn.w_1.weight", "decoder.layer_stack.4.pos_ffn.w_1.bias", "decoder.layer_stack.4.pos_ffn.w_2.weight", "decoder.layer_stack.4.pos_ffn.w_2.bias", "decoder.layer_stack.4.pos_ffn.layer_norm.weight", "decoder.layer_stack.4.pos_ffn.layer_norm.bias", "decoder.layer_stack.5.slf_attn.w_qs.weight", "decoder.layer_stack.5.slf_attn.w_qs.bias", "decoder.layer_stack.5.slf_attn.w_ks.weight", "decoder.layer_stack.5.slf_attn.w_ks.bias", "decoder.layer_stack.5.slf_attn.w_vs.weight", "decoder.layer_stack.5.slf_attn.w_vs.bias", "decoder.layer_stack.5.slf_attn.layer_norm.weight", "decoder.layer_stack.5.slf_attn.layer_norm.bias", "decoder.layer_stack.5.slf_attn.fc.weight", "decoder.layer_stack.5.slf_attn.fc.bias", "decoder.layer_stack.5.pos_ffn.w_1.weight", "decoder.layer_stack.5.pos_ffn.w_1.bias", "decoder.layer_stack.5.pos_ffn.w_2.weight", "decoder.layer_stack.5.pos_ffn.w_2.bias", "decoder.layer_stack.5.pos_ffn.layer_norm.weight", "decoder.layer_stack.5.pos_ffn.layer_norm.bias", "mel_linear.weight", "mel_linear.bias", "postnet.convolutions.0.0.conv.weight", "postnet.convolutions.0.0.conv.bias", "postnet.convolutions.0.1.weight", "postnet.convolutions.0.1.bias", "postnet.convolutions.0.1.running_mean", "postnet.convolutions.0.1.running_var", "postnet.convolutions.1.0.conv.weight", "postnet.convolutions.1.0.conv.bias", "postnet.convolutions.1.1.weight", "postnet.convolutions.1.1.bias", "postnet.convolutions.1.1.running_mean", "postnet.convolutions.1.1.running_var", "postnet.convolutions.2.0.conv.weight", "postnet.convolutions.2.0.conv.bias", "postnet.convolutions.2.1.weight", "postnet.convolutions.2.1.bias", "postnet.convolutions.2.1.running_mean", "postnet.convolutions.2.1.running_var", "postnet.convolutions.3.0.conv.weight", "postnet.convolutions.3.0.conv.bias", "postnet.convolutions.3.1.weight", "postnet.convolutions.3.1.bias", "postnet.convolutions.3.1.running_mean", "postnet.convolutions.3.1.running_var", "postnet.convolutions.4.0.conv.weight", "postnet.convolutions.4.0.conv.bias", "postnet.convolutions.4.1.weight", "postnet.convolutions.4.1.bias", "postnet.convolutions.4.1.running_mean", "postnet.convolutions.4.1.running_var".
Unexpected key(s) in state_dict: "model", "optimizer".

You are facing this issue since there are two states saved in this checkpoint. The model and the optimizer. To fine tune you need to load only the model component and it should work fine. As an experiment you can use a jupyter notebook to load the 900000.pth.tar and visualize it. That'll give you more clarity.