johnGettings / LIHQ

Long-Inference, High Quality Synthetic Speaker (AI avatar/ AI presenter)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Replicate.com

galipmedia opened this issue · comments

It would be awesome to see this on replicate.com the only one on there at present is the makeItTalk 256x256 model and its not great.

The problem with LIHQ is that it has such a long inference time (hence the name, "Long Inference, High Quality"). Although this is probably the best looking open source option as far as I'm aware, it's probably not feasible to use this much computing power and allot this much time for these types of videos.

That's why I decided to open source it to hobbyists over google colab instead of trying to monetize it in any way.

Fair enough. I would be curious to know your thoughts on how d-id.com does it so fast, its nearly instant.

I wouldn't be surprised if it was a method very similar to mine. Just a first order motion model for head, eyes, face motion and a wav2lip model to superimpose the mouth over the FOMM output. If you notice, the mouth in their videos is distinctly worse quality than the rest of the face. Which means they likely use two different models to generate it.

I'm thinking the main difference between my process and theirs is just that I use the default 256x256 FOMM and wav2lip models. Which means I need to use a two-stage AI powered restoration and upscale process on every frame of the video, along with optional frame interpolation. And that takes a very long time. They probably trained their own high resolution models to skip all that and use some powerful GPUs to speed it up even more.

Could be a fun project for anyone who wants to try.