Not able to generate audio using libritts of as good quality as using ljspeech

Question

Not able to generate audio using libritts of as good quality as using ljspeech

Hertin opened this issue 2 years ago · comments

Hi, Thank you for the great work and for releasing the pretrained model. I tried to train the grad-tts model using libritts (multispeaker) and using ljspeech (single speaker) and found that the single-speaker setting gives much better quality than the multispeaker. This is even true when using your released grad-tts-libri-tts.pt. Are you able to get better quality in multispeaker setting? These are a few samples I generated in multispeaker setting using your released model: https://drive.google.com/drive/folders/1ze0_rJXtmPY3JNAwnr0A_9C4OVvULEj7?usp=sharing.

Ivan Vovk · Answer 1 · Fri Oct 28 2022 11:12:00 GMT+0800 (China Standard Time)

@Hertin Hi! We provided the multi-speaker setting and checkpoint in this code for the basic review only and were not aimed to achieve competitive results. In order to bring the quality gap for the multi-speaker scenario, I would suggest you to increase the number of channels in the model and use pre-trained speaker encoder instead of learnable speaker embeddings. In our own experiments, that helped to get much better results.

Heting Gao · Answer 2 · Fri Oct 28 2022 15:15:29 GMT+0800 (China Standard Time)

Thank you for the suggestions. I will try to increase the number of channels. Are you able to share the configurations (eg. n_enc_channels, filter_channels, etc) you use to get better results? I understand if you can't or don't want to.

Ivan Vovk · Answer 3 · Fri Oct 28 2022 16:31:21 GMT+0800 (China Standard Time)

@Hertin try [256, 512, 1024] channels in UNet. Its dim=256 (dec_dim in params.py) and dim_mults=[1, 2, 4] in UNet constructor.

Heting Gao · Answer 4 · Fri Oct 28 2022 22:23:25 GMT+0800 (China Standard Time)

Thank you very much. I will try this setting.

Théodor Lemerle · Answer 5 · Fri Mar 03 2023 00:45:02 GMT+0800 (China Standard Time)

Hi, thanks for the great work and the insights you gave.
I'm also interested also in setting up your model for the multispeaker task.
Following your advice of increasing the number of channels, the number of parameters obviously rocketed as well as the memory usage.
Actually I barely can go more than dec_dim=72 since it saturates GPU memory, with a gain of 4GB from dec_dim=64.
Have you been able to train it on larger GPU ? Does it scale good ? Any other advice for the multispeaker setting ?

Heting Gao · Answer 6 · Fri Mar 03 2023 05:44:32 GMT+0800 (China Standard Time)

No. I gave up. Although I was able to train the model, the generated audio does not sound as good as the single-speaker setting.

Ivan Vovk · Answer 7 · Fri Mar 03 2023 16:47:05 GMT+0800 (China Standard Time)

@theodorblackbird @Hertin check out the more smart way to design multi-speaker Grad-TTS from here. They use Adaptive Layer Normalization and additional transformer style-encoder after alignment. Should be feasible in terms of memory usage keeping the diffusion itself small enough. Its performance is at SOTA level as far as I know.

Théodor Lemerle · Answer 8 · Fri Mar 03 2023 21:19:48 GMT+0800 (China Standard Time)

@Hertin @ivanvovk Thanks for your feedbacks.

I'll investigate this paper since I really believe too that your model is a solid basis for a multispeaker model.

iooops · Answer 9 · Tue May 02 2023 22:54:30 GMT+0800 (China Standard Time)

@theodorblackbird @Hertin check out the more smart way to design multi-speaker Grad-TTS from here. They use Adaptive Layer Normalization and additional transformer style-encoder after alignment. Should be feasible in terms of memory usage keeping the diffusion itself small enough. Its performance is at SOTA level as far as I know.

I wonder if I should add additional loss computation to diff loss. I tried adding Adaptive Layer Normalization and additional transformer style-encoder but the result's getting weird.. I think maybe additional loss calculation for style is also required?????

Théodor Lemerle · Answer 10 · Thu May 11 2023 01:37:54 GMT+0800 (China Standard Time)

@iooops
I've just launched some experiments with the so called Style Adaptive LN and a stack of self attention for style encoding. I'll tell you if it works well for me. Only difference is that I don't use MAS for upsampling phoneme code but rather use precomputed alignments.