How to imporve accuracy/natural of lipsync
mi2think opened this issue · comments
I made a video comparison with Oculus Lipsync: Oculus vs Rhubarb. It looks not pleasure. The next step I'll try interpolation between two visemes. Maybe it'll looks more natural after doing this. But I'm doubt I can accomplish it like Oculus.
Take a frame of videos, I found viseme weights as below:
Oculus: Video Percent: 0.0040, Visemes:[0.0218, 0.0004, 0.0001, 0.0005, 0.0009, 0.0004, 0.0001, 0.0001, 0.0334, 0.0001, 0.5889, 0.2065, 0.0023, 0.0065, 0.1380]
Rhubarb: Video Percent: 0.0040, Visemes: E
Oculus use 15 visemes: Viseme Reference, I really do not know how they calculate weights between many visemes, just know they use deep neural network.
So any plan or suggestion on rhubarb lipsync?
Update: Add interpolation between two neighboring visemes: Oculus vs Rhubarb with interpolation
There are multiple reasons why the Rhubarb output looks worse than the Oculus output.
- The original recording has a lot of reverb. Rhubarb was developed for dry, high-quality recordings and doesn't deal well with reverb.
- Rhubarb uses a much simpler architecture than Oculus without neural networks.
- Rhubarb was never meant for 3D animation. In its current state, Rhubarb is optimized for cartoon-style 2D animation.
The best thing you can do to improve results is use a dry recording. But that still won't give you the kind of 3D animation you got with Oculus.