How did you generate the training data (bs values)?

Question

How did you generate the training data (bs values)?

ChairManMeow-SY opened this issue a year ago · comments

As I can find in your google drive, the training data is generated by Zishu Mei's audio recordings, which are obtained with TTS algorithm.

So how do you get the blendshapes of Zishu Mei? It confuses me because that Zishu Mei is a 3D model.

If the blend shape values are captured from a human with the facial motion capturing system, how do we guarantee the TTS audio matches the bs value?

Please correct me if I misunderstand something..

Nero · Answer 1 · Fri Feb 24 2023 14:56:54 GMT+0800 (China Standard Time)

That is a good question.
In the early stage of the project, we used the tts to create the audio data from text .
And we manually created the mdoel's animation of ZishuMei according to the audio data.Like listening to a pronunciation, making an animation.
For now,we suggest to use the facial motion capturing system. Recording the video and audio at the same time of actor's performance . TTS audio is the actor's customized voice.

Shengyang Zhao · Answer 2 · Fri Feb 24 2023 15:04:23 GMT+0800 (China Standard Time)

That is a good question. In the early stage of the project, we used the tts to create the audio data from text . And we manually created the mdoel's animation of ZishuMei according to the audio data.Like listening to a pronunciation, making an animation. For now,we suggest to use the facial motion capturing system. Recording the video and audio at the same time of actor's performance . TTS audio is the actor's customized voice.

Amazing but great job! The training data is really expensive and valuable. Thank you so much for the reply and for the data.