quickvc / QuickVC-VoiceConversion

Summary

The paper said that LSTM is one layer.
The implementation use 3-layer LSTM.
Which is correct?

Current status

The paper said that speaker encoder sub-model consists of one layer of LSTM structure.

The network structure of the speaker encoder contains one layer of LSTM structure and one layer of fully connected layers

But, SpeakerEncoder implementation use 3-layer LSTM module.

QuickVC-VoiceConversion/models.py

Lines 601 to 604 in 277118d

    
           class SpeakerEncoder(torch.nn.Module): 
        
               def __init__(self, mel_n_channels=80, model_num_layers=3, model_hidden_size=256, model_embedding_size=256): 
        
                   super(SpeakerEncoder, self).__init__() 
        
                   self.lstm = nn.LSTM(mel_n_channels, model_hidden_size, model_num_layers, batch_first=True)

So, there seems to be a deviation from the paper in implementation.

Question

For result/demo of the paper, how many LSTM layers are used?
Is this a deviation or just interpretation of the word LSTM structure? (one layer of LSTM structure == one block of 3-layer LSTM module?)

Sorry for the inconvenience.

The open-sourced code is the correct answer.
So one layer of LSTM structure == one block of 3-layer LSTM module

By the way, after about four months, from my current point of view, there is a lot of disadvantage to this work Q_Q

Thanks for clear answer!

Even if there are some disadvantages, open-sourced QuickVC has contributed a lot to the community!
I am looking forward to seeing improved works of you and your teams in future😉

	class SpeakerEncoder(torch.nn.Module):
	def __init__(self, mel_n_channels=80, model_num_layers=3, model_hidden_size=256, model_embedding_size=256):
	super(SpeakerEncoder, self).__init__()
	self.lstm = nn.LSTM(mel_n_channels, model_hidden_size, model_num_layers, batch_first=True)

Deviation from paper: speaker encoder LSTM layer

Summary

Current status

Question