fudan-generative-vision / hallo

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

Home Page:https://fudan-generative-vision.github.io/hallo/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is this realtime?

fire17 opened this issue · comments

commented

Hi there
First of all amazing project!
was wondering what is the expected latency for a short audio (2-5 seconds)?
Is it instant? Less then a second?

Wondering if this can be used in a realtime local ai voice/video conversations
anything over a second is not usable in realtime user-facing applications
but could be good for other plenty of other cases

Let me know, and also it would be nice if the answer was seperated to cloud gpus, and local consumer gpus (for local use)
Thanks and all the best!

10 minutes for 5 seconds of audio… definitely not real time, I hope latency is improved.

10 minutes for 5 seconds of audio… definitely not real time, I hope latency is improved.

Which GPU is being used?

Can it be realtime if a more powerful gpu is used? @puffy310 were you running inference on your local machine?

I was not using a local GPU but using L4 Rented on HF with https://huggingface.co/spaces/fudan-generative-ai/hallo it is still early tech but I have not checked for a month so inference time may have improved significantly.

In theory anything can be ran in real time with powerful enough hardware, I do not know the threshold for GPUs to run this at 8 or 12FPS. It's likely 8xH100 wouldn't even get close. Maybe someone from Fudan can give some more insight.