Can you provide a readme?

Question

Can you provide a readme?

qiaoyuqy opened this issue 17 days ago · comments

qiaoyuqy commented 17 days ago

Hello, can you provide a readme, and then write some configuration process and introduction, thank you very much

Steve Korshakov · Answer 1 · Thu Jul 11 2024 02:19:59 GMT+0800 (China Standard Time)

Hey, i am in the process of training still!

Jarod Mica · Answer 2 · Fri Jul 12 2024 06:45:44 GMT+0800 (China Standard Time)

Been following this process closely, so excited to see what comes out of it!

@qiaoyuqy I got a headstart in setting up my env for training by converting the libri dataset with https://github.com/ex3ndr/supervoice-librilight-encodec and then the train scripts seem to work just fine with this

Steve Korshakov · Answer 3 · Fri Jul 12 2024 06:58:48 GMT+0800 (China Standard Time)

Actually I am almost finished NAR model training, works really well for in-domain samples. Also you can download pre-converted datasets using my dataset tool.

Jarod Mica · Answer 4 · Sat Jul 13 2024 03:59:41 GMT+0800 (China Standard Time)

Awesome, thanks for the info! Is there any major difference between training straight from a dataset converted with your encodec repo, vs the alignment you did with the preprocessed dataset repo?

I noticed you add that whisper and mfa was used, is there a reason for this given libri has its own transcripts?

Steve Korshakov · Answer 5 · Sat Jul 13 2024 05:47:03 GMT+0800 (China Standard Time)

My librilight-preprocessed is my naive attempt to transcribe it, but it is a failed one - too many errors and networks trained on it turned out to have too much errors like misspelling, slightly wrong word. They are quite the same audio-wise though. Libriheavy has much much better transcriptions

Jarod Mica · Answer 6 · Sat Jul 13 2024 06:35:27 GMT+0800 (China Standard Time)

Gotcha! So just a clarifying question; I have the jsonl.gz files from libriheavy and the audio files from librilight. I then used your librilight-encodec repository, ran encode.py, and have the converted output.

Is this fine to run on or would you recommend to run on the preprocessed datasets downloadable from your server?

Currently in the middle of downloading one of these preprocessed datasets, but haven't been able to look through yet for differences so thought I'd just ask for clarification here.

Steve Korshakov · Answer 7 · Sun Jul 14 2024 04:36:16 GMT+0800 (China Standard Time)

They should be exactly the same, all my work is reproducible! So it is up to you.

Jarod Mica · Answer 8 · Sun Jul 14 2024 05:31:15 GMT+0800 (China Standard Time)

Awesome, well, appreciate the work and I'm about 130k steps in training, will see how this goes!

Steve Korshakov · Answer 9 · Mon Jul 15 2024 05:58:59 GMT+0800 (China Standard Time)

I have finished the training, published the results. Networks follows the speaker much better than Voicebox, but still not that good as should be for out of domain speakers.

nguyenlm · Answer 10 · Mon Jul 22 2024 10:52:40 GMT+0800 (China Standard Time)

@ex3ndr Can we use voice-cloning training now?

Steve Korshakov · Answer 11 · Mon Jul 22 2024 10:55:06 GMT+0800 (China Standard Time)

This is a zero-shot voice cloning network, nothing to train here, just 3-5 second clean sample with text

nguyenlm · Answer 12 · Mon Jul 22 2024 12:42:31 GMT+0800 (China Standard Time)

@ex3ndr Thanks, How can I add another language?