Use this model for Voice conversion

Question

Use this model for Voice conversion

rishikksh20 opened this issue 7 months ago · comments

Rishikesh (ऋषिकेश) commented 7 months ago

Can we use this kind model for speech to speech (Voice conversion).

adelacvg · Answer 1 · Mon Dec 18 2023 10:56:21 GMT+0800 (China Standard Time)

Yes, and it's exactly what I'm working on. You can have a look here for a rough idea of the approach. The main idea is to utilize Referencenet to enhance the zero-shot capability.

Rishikesh (ऋषिकेश) · Answer 2 · Mon Dec 18 2023 14:29:37 GMT+0800 (China Standard Time)

I checked v4 branch looks good to me. Have you train the model if yes how's the quality?

Rishikesh (ऋषिकेश) · Answer 3 · Mon Dec 18 2023 14:40:38 GMT+0800 (China Standard Time)

I checked your v3 branch also and samples are sounding good. Have you train that model on any english dataset ?

Rishikesh (ऋषिकेश) · Answer 4 · Mon Dec 18 2023 14:54:15 GMT+0800 (China Standard Time)

I would like to train v3 and v4 for large english dataset.
Would you guide me little bit. May HuBERT , XLS-R use to extract semantic vector or contentvec is only required?

adelacvg · Answer 5 · Tue Dec 19 2023 12:39:46 GMT+0800 (China Standard Time)

I do not recommend training with v3 because it still uses inefficient modules like FiLM for timbre addition. As for training with v4, all I can say is that the training is very, very slow, but it's worth it. Using a small batch size and a relatively large learning rate may be a cost-effective approach. The longer the training time, the better the results. Although contentvec may not be perfect, I think it's sufficient. Other semantic features might lead to timbre leakage, although I haven't conducted extensive experiments to validate this.

adelacvg · Answer 6 · Tue Dec 19 2023 12:40:08 GMT+0800 (China Standard Time)

I trained using the same dataset as v2, which is a mixed dataset containing both Chinese and English.

Rishikesh (ऋषिकेश) · Answer 7 · Wed Dec 20 2023 13:30:10 GMT+0800 (China Standard Time)

Ok than I will try to train v4 only, but is that repo completed implemented or something remains ?
If it's completed have run any kind of train on it ?

adelacvg · Answer 8 · Wed Dec 20 2023 16:13:10 GMT+0800 (China Standard Time)

The current code is trainable, and I have obtained some promising results. It's worth noting that the convergence is slow, and a batch size of 32 takes about 500k steps to yield satisfactory results. I have implemented the code for modules like cfg and offset noise, but for the sake of training stability, I haven't added them temporarily. These functionalities can be added through fine-tuning after convergence.

Rishikesh (ऋषिकेश) · Answer 9 · Wed Dec 20 2023 16:26:22 GMT+0800 (China Standard Time)

dataset size? and on how many gpus you trained the model?
Actually, I am planning to train this model on Multi-lingual Librispeech model which have 50k hours of data. But before that I will do a demo training on small dataset size of around 3k to 5k hours to check the parameters and training stability.

adelacvg · Answer 10 · Wed Dec 20 2023 18:04:28 GMT+0800 (China Standard Time)

I only used 300 hours of data, and the training was done exclusively on two GeForce RTX 3090 GPUs.

Rishikesh (ऋषिकेश) · Answer 11 · Wed Jan 17 2024 13:00:00 GMT+0800 (China Standard Time)

Hi @adelacvg is implementation of this end-to-end TTS repo is completed. I have tested NS2VC v4 on 500 hrs of Hindi dataset with whisper features and it's working great, I have few findings on that repo which I will share on that repo issues.