DigitalPhonetics / IMS-Toucan

Multilingual and Controllable Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Quality of exact style cloning

raymond00000 opened this issue · comments

Hi guys,

I am very impressed with the paper's idea and result.
Also, thanks a lot for sharing the codes too.

I am trying the example by using Obama's voice to speak the demo page's sample: "Wow, what a beautiful day!"
However, it seems to me the voice of the synthesized speech does not sound uttered by Obama.
I would like to seek your advice if I did wrongly, or how to make a better cloned audio.

I attached the ref audio and cloned audio in the attached zip.
audios.zip

this is the code I tried.

uc = UtteranceCloner(model_id="Meta", device="cuda" if torch.cuda.is_available() else "cpu")
uc.tts.set_utterance_embedding("audios/20090307_Weekly_Address.0000.0001.wav")
uc.clone_utterance(path_to_reference_audio="audios/human.wav",
                   reference_transcription="Wow, what a beautiful day!",
                   filename_of_result="audios/obama_test_cloned_wow.wav",
                   clone_speaker_identity=False,
                   lang="en")    

Thanks!

It looks like you're doing it correctly, there's just 2 issues: To make it sound like Obama, the reference would already need to have a similar speaking style to him, because the speaking style is taken entirely from the prosody reference audio. This has an even bigger impact than the voice on the ability to recognize the speaker. The second problem is that this toolkit is currently very bad at cloning voices unseen during training. It is possible, but it doesn't work well. I'm working on improving this for the last few months, but it's very challenging, so it will take more time.