adelacvg / ttts

Train the next generation of TTS systems.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Use this model for Voice conversion

rishikksh20 opened this issue · comments

Hi @adelacvg

Can we use this kind model for speech to speech (Voice conversion).

Yes, and it's exactly what I'm working on. You can have a look here for a rough idea of the approach. The main idea is to utilize Referencenet to enhance the zero-shot capability.

I checked v4 branch looks good to me. Have you train the model if yes how's the quality?

I checked your v3 branch also and samples are sounding good. Have you train that model on any english dataset ?

I would like to train v3 and v4 for large english dataset.
Would you guide me little bit. May HuBERT , XLS-R use to extract semantic vector or contentvec is only required?

I do not recommend training with v3 because it still uses inefficient modules like FiLM for timbre addition. As for training with v4, all I can say is that the training is very, very slow, but it's worth it. Using a small batch size and a relatively large learning rate may be a cost-effective approach. The longer the training time, the better the results. Although contentvec may not be perfect, I think it's sufficient. Other semantic features might lead to timbre leakage, although I haven't conducted extensive experiments to validate this.

I trained using the same dataset as v2, which is a mixed dataset containing both Chinese and English.

Ok than I will try to train v4 only, but is that repo completed implemented or something remains ?
If it's completed have run any kind of train on it ?

The current code is trainable, and I have obtained some promising results. It's worth noting that the convergence is slow, and a batch size of 32 takes about 500k steps to yield satisfactory results. I have implemented the code for modules like cfg and offset noise, but for the sake of training stability, I haven't added them temporarily. These functionalities can be added through fine-tuning after convergence.

dataset size? and on how many gpus you trained the model?
Actually, I am planning to train this model on Multi-lingual Librispeech model which have 50k hours of data. But before that I will do a demo training on small dataset size of around 3k to 5k hours to check the parameters and training stability.

I only used 300 hours of data, and the training was done exclusively on two GeForce RTX 3090 GPUs.

Hi @adelacvg is implementation of this end-to-end TTS repo is completed. I have tested NS2VC v4 on 500 hrs of Hindi dataset with whisper features and it's working great, I have few findings on that repo which I will share on that repo issues.