tencent-ailab / V-Express

V-Express aims to generate a talking head video under the control of a reference image, an audio, and a sequence of V-Kps images.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some problems and conclusions - Eye movements / inconsistencies / quality

A-2-H opened this issue · comments

commented

First of all, great work!

These are what I observed testing the program:

  • facial animation and likeness is best compared to other repos that I tried (except EMO)

  • consistency in some videos work better with no-retarget and some better with naive-retarget. I can't figure out what is the cause, even if the pose of the input image and input video are very similar.

  • eyeballs doesn't move. Eyes only blink and that's give away the realism of the generated videos. Eyes don't turn to the left/right/up/down, they just look straight.

  • no emotions on the face. It looks a little bit stiff. Mouth movement looks ok, but when on video there are some expression, the generated video doesn't show that. Especially we can observe this on eyes/eyebrows. No real smile or frown, just an expression that is presented on input image.

  • there is something strange about consistency. If the video is quite short the inconsistencies and jitteries ar less likely to happen. The more the video gets in length the more often jittery and bugs happen. It looks like the less frame it has the less it's get conffused.
    It's hard to test videos on it because when I setup everything with short clip to test settings and everything is quite qood, than on the longer video it doesn't look as good as it was in this shorter clip.

  • the generated results looks a bit compressed. Is this the way ffmepg do it? In 512x512 look like the quality could be better/less compressed so what is the cause?

  • above 640x640 px there is an error about no faces in frame/too many faces. Below this number everythings work.

  • usually when jittery happen it looks like its 2-3 frames and it goes back to track. Is there a way to avoid this? It looks like it looses for 2-3 sec the track of body consistency and then goes back to what it should be.

  • the more audio attention weight the more contrasted/distorted the results are

  • sometimes background move with head motions

  • different model than original sd 1.5 will change anything?

  • render time is very long and demanding even on rtx4090.

  • it would be cool if there would be some kind of debug script so we could see preview of facial landmarks on image and a input video as sometimes it can help to avoid waste of time on render.

  • DreamTalk is a project that Alibaba's team contributed, maybe it can provide some information about consistancy/facial animation?

I tried this tutorial https://www.youtube.com/watch?v=ttEOIg9j2B4&t=327s and everything works fine but the quality of the output is really bad. Sadtalker is way better than v-express. I hope tencent can create something better than this

I'm here because i am looking for an alternative to sadtalker because devs might have abandoned it. It's just sad that this newer Ai by tencent has output quality issues.

with "eyeballs doesn't move" - i think this logic from VASA - solves this problem
https://github.com/johndpope/VASA-1-hack/blob/main/FaceHelper.py#L155

@oisilener1982
I guess there may be some errors in your settings. If you provide a front-facing photo and the face ratio meets the requirements, there should not be low-quality results. In addition, if a frontal video is provided as a v-kps reference sequence, the results will be more stable.

test3.41.mp4