Quality of exact style cloning

Question

Quality of exact style cloning

raymond00000 opened this issue a year ago · comments

Hi guys,

I am very impressed with the paper's idea and result.
Also, thanks a lot for sharing the codes too.

I am trying the example by using Obama's voice to speak the demo page's sample: "Wow, what a beautiful day!"
However, it seems to me the voice of the synthesized speech does not sound uttered by Obama.
I would like to seek your advice if I did wrongly, or how to make a better cloned audio.

I attached the ref audio and cloned audio in the attached zip.
audios.zip

this is the code I tried.

uc = UtteranceCloner(model_id="Meta", device="cuda" if torch.cuda.is_available() else "cpu")
uc.tts.set_utterance_embedding("audios/20090307_Weekly_Address.0000.0001.wav")
uc.clone_utterance(path_to_reference_audio="audios/human.wav",
                   reference_transcription="Wow, what a beautiful day!",
                   filename_of_result="audios/obama_test_cloned_wow.wav",
                   clone_speaker_identity=False,
                   lang="en")

Thanks!

Florian Lux · Answer 1 · Mon Jul 17 2023 23:17:32 GMT+0800 (China Standard Time)

It looks like you're doing it correctly, there's just 2 issues: To make it sound like Obama, the reference would already need to have a similar speaking style to him, because the speaking style is taken entirely from the prosody reference audio. This has an even bigger impact than the voice on the ability to recognize the speaker. The second problem is that this toolkit is currently very bad at cloning voices unseen during training. It is possible, but it doesn't work well. I'm working on improving this for the last few months, but it's very challenging, so it will take more time.