jasonppy / VoiceCraft

Zero-Shot Speech Editing and Text-to-Speech in the Wild

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Generating long speeches

RootingInLoad opened this issue · comments

Would there be a way to generate long speeches ?

Because right now, it requires to be fed with at least 3 seconds of speech each time you want to inference something new. And if the length of the desired generation is too long, it hallucinates and ends up doing gibberish stuff.

One way to solve this issue would be to generate speeches sentence by sentence. One issue with that is that it'll still require those 3 seconds of base speech each time. The other one is the consistency of the generated speech at the end, as the different intonations between the sentences would be immensely off.

Anyone has an idea ?

if you were to do something like that, you should just use the generated speech to predict the next speech

I am worried that using generated chunk to generate new stuff will accumulate error and lead to similar issue.

I am using demo voice and trying to generate text with 550 characters.

Edit: Just tried both aproaches, the error accumulates and the output seems even worse when I use the last generated segment for a next one. The first aproach with fixed segment works, but it does lose fluidity.

I don't have a good solution for that right now. It might require some model development

One middle ground is to concatenate the original prompt and the previously generated sentence as the prompt to generate the next sentence. original prompt is ground truth voice which prevent voice drifting, the newly generate sentence ensures consistency in intonations. disadvantage: 1) might not work; 2) this leads to long prompt and therefore limiting the generation length

I am worried that using generated chunk to generate new stuff will accumulate error and lead to similar issue.

I am using demo voice and trying to generate text with 550 characters.

Edit: Just tried both aproaches, the error accumulates and the output seems even worse. The first one works, but it does lose fluidity.

I tried it with one extra generation and it seemed fine

if you were to do something like that, you should just use the generated speech to predict the next speech

Yeah, that could be a good idea. But the issue, mentioned after, is that over time it'll definitely lose consistency. As I'd like to generate at least small articles with a custom voice, this won't really work.

What if we trained the model for one voice. Could we then do inference without an audio sample, while still having consistency ?

I am worried that using generated chunk to generate new stuff will accumulate error and lead to similar issue.
I am using demo voice and trying to generate text with 550 characters.
Edit: Just tried both aproaches, the error accumulates and the output seems even worse. The first one works, but it does lose fluidity.

I tried it with one extra generation and it seemed fine

Did you do any extra tweaks to achieve such a result ?