huawei-noah / Speech-Backbones

This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Not able to generate audio using libritts of as good quality as using ljspeech

Hertin opened this issue · comments

Hi, Thank you for the great work and for releasing the pretrained model. I tried to train the grad-tts model using libritts (multispeaker) and using ljspeech (single speaker) and found that the single-speaker setting gives much better quality than the multispeaker. This is even true when using your released grad-tts-libri-tts.pt. Are you able to get better quality in multispeaker setting? These are a few samples I generated in multispeaker setting using your released model: https://drive.google.com/drive/folders/1ze0_rJXtmPY3JNAwnr0A_9C4OVvULEj7?usp=sharing.

@Hertin Hi! We provided the multi-speaker setting and checkpoint in this code for the basic review only and were not aimed to achieve competitive results. In order to bring the quality gap for the multi-speaker scenario, I would suggest you to increase the number of channels in the model and use pre-trained speaker encoder instead of learnable speaker embeddings. In our own experiments, that helped to get much better results.

Thank you for the suggestions. I will try to increase the number of channels. Are you able to share the configurations (eg. n_enc_channels, filter_channels, etc) you use to get better results? I understand if you can't or don't want to.

@Hertin try [256, 512, 1024] channels in UNet. Its dim=256 (dec_dim in params.py) and dim_mults=[1, 2, 4] in UNet constructor.

Thank you very much. I will try this setting.

Hi, thanks for the great work and the insights you gave.
I'm also interested also in setting up your model for the multispeaker task.
Following your advice of increasing the number of channels, the number of parameters obviously rocketed as well as the memory usage.
Actually I barely can go more than dec_dim=72 since it saturates GPU memory, with a gain of 4GB from dec_dim=64.
Have you been able to train it on larger GPU ? Does it scale good ? Any other advice for the multispeaker setting ?

No. I gave up. Although I was able to train the model, the generated audio does not sound as good as the single-speaker setting.

@theodorblackbird @Hertin check out the more smart way to design multi-speaker Grad-TTS from here. They use Adaptive Layer Normalization and additional transformer style-encoder after alignment. Should be feasible in terms of memory usage keeping the diffusion itself small enough. Its performance is at SOTA level as far as I know.

@Hertin @ivanvovk Thanks for your feedbacks.

I'll investigate this paper since I really believe too that your model is a solid basis for a multispeaker model.

commented

@theodorblackbird @Hertin check out the more smart way to design multi-speaker Grad-TTS from here. They use Adaptive Layer Normalization and additional transformer style-encoder after alignment. Should be feasible in terms of memory usage keeping the diffusion itself small enough. Its performance is at SOTA level as far as I know.

I wonder if I should add additional loss computation to diff loss. I tried adding Adaptive Layer Normalization and additional transformer style-encoder but the result's getting weird.. I think maybe additional loss calculation for style is also required?????

@iooops
I've just launched some experiments with the so called Style Adaptive LN and a stack of self attention for style encoding. I'll tell you if it works well for me. Only difference is that I don't use MAS for upsampling phoneme code but rather use precomputed alignments.