quickvc / QuickVC-VoiceConversion

QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deviation from paper: speaker encoder LSTM layer

tarepan opened this issue · comments

Summary

The paper said that LSTM is one layer.
The implementation use 3-layer LSTM.
Which is correct?

Current status

The paper said that speaker encoder sub-model consists of one layer of LSTM structure.

The network structure of the speaker encoder contains one layer of LSTM structure and one layer of fully connected layers

But, SpeakerEncoder implementation use 3-layer LSTM module.

class SpeakerEncoder(torch.nn.Module):
def __init__(self, mel_n_channels=80, model_num_layers=3, model_hidden_size=256, model_embedding_size=256):
super(SpeakerEncoder, self).__init__()
self.lstm = nn.LSTM(mel_n_channels, model_hidden_size, model_num_layers, batch_first=True)

So, there seems to be a deviation from the paper in implementation.

Question

  • For result/demo of the paper, how many LSTM layers are used?
  • Is this a deviation or just interpretation of the word LSTM structure? (one layer of LSTM structure == one block of 3-layer LSTM module?)
commented

Sorry for the inconvenience.

The open-sourced code is the correct answer.
So one layer of LSTM structure == one block of 3-layer LSTM module

By the way, after about four months, from my current point of view, there is a lot of disadvantage to this work Q_Q

Thanks for clear answer!

Even if there are some disadvantages, open-sourced QuickVC has contributed a lot to the community!
I am looking forward to seeing improved works of you and your teams in future😉