ex3ndr / supervoice-vall-e-2

VALL-E 2 reproduction

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can you provide a readme?

qiaoyuqy opened this issue · comments

Hello, can you provide a readme, and then write some configuration process and introduction, thank you very much

Hey, i am in the process of training still!

Been following this process closely, so excited to see what comes out of it!

@qiaoyuqy I got a headstart in setting up my env for training by converting the libri dataset with https://github.com/ex3ndr/supervoice-librilight-encodec and then the train scripts seem to work just fine with this

Actually I am almost finished NAR model training, works really well for in-domain samples. Also you can download pre-converted datasets using my dataset tool.

Awesome, thanks for the info! Is there any major difference between training straight from a dataset converted with your encodec repo, vs the alignment you did with the preprocessed dataset repo?

I noticed you add that whisper and mfa was used, is there a reason for this given libri has its own transcripts?

My librilight-preprocessed is my naive attempt to transcribe it, but it is a failed one - too many errors and networks trained on it turned out to have too much errors like misspelling, slightly wrong word. They are quite the same audio-wise though. Libriheavy has much much better transcriptions

Gotcha! So just a clarifying question; I have the jsonl.gz files from libriheavy and the audio files from librilight. I then used your librilight-encodec repository, ran encode.py, and have the converted output.

Is this fine to run on or would you recommend to run on the preprocessed datasets downloadable from your server?

Currently in the middle of downloading one of these preprocessed datasets, but haven't been able to look through yet for differences so thought I'd just ask for clarification here.

They should be exactly the same, all my work is reproducible! So it is up to you.

Awesome, well, appreciate the work and I'm about 130k steps in training, will see how this goes!

I have finished the training, published the results. Networks follows the speaker much better than Voicebox, but still not that good as should be for out of domain speakers.

@ex3ndr Can we use voice-cloning training now?

This is a zero-shot voice cloning network, nothing to train here, just 3-5 second clean sample with text

@ex3ndr Thanks, How can I add another language?