How to imporve accuracy/natural of lipsync

Question

How to imporve accuracy/natural of lipsync

mi2think opened this issue 3 years ago · comments

I made a video comparison with Oculus Lipsync: Oculus vs Rhubarb. It looks not pleasure. The next step I'll try interpolation between two visemes. Maybe it'll looks more natural after doing this. But I'm doubt I can accomplish it like Oculus.

Take a frame of videos, I found viseme weights as below:

Oculus: Video Percent: 0.0040, Visemes:[0.0218, 0.0004, 0.0001, 0.0005, 0.0009, 0.0004, 0.0001, 0.0001, 0.0334, 0.0001, 0.5889, 0.2065, 0.0023, 0.0065, 0.1380]
Rhubarb: Video Percent: 0.0040, Visemes: E

Oculus use 15 visemes: Viseme Reference, I really do not know how they calculate weights between many visemes, just know they use deep neural network.

So any plan or suggestion on rhubarb lipsync?

Houwb · Answer 1 · Mon Jan 25 2021 18:03:30 GMT+0800 (China Standard Time)

Update: Add interpolation between two neighboring visemes: Oculus vs Rhubarb with interpolation

Daniel Wolf · Answer 2 · Tue Jan 26 2021 15:50:17 GMT+0800 (China Standard Time)

There are multiple reasons why the Rhubarb output looks worse than the Oculus output.

The original recording has a lot of reverb. Rhubarb was developed for dry, high-quality recordings and doesn't deal well with reverb.
Rhubarb uses a much simpler architecture than Oculus without neural networks.
Rhubarb was never meant for 3D animation. In its current state, Rhubarb is optimized for cartoon-style 2D animation.

The best thing you can do to improve results is use a dry recording. But that still won't give you the kind of 3D animation you got with Oculus.