rhasspy / larynx

End to end text to speech system using gruut and onnx

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ideas for lipsync and visemes?

kinoc opened this issue · comments

commented

First, love the project !

I have a robotic and virtual agent project that I'm trying to get as close to real-time response as possible.
I use the following to generate speech:
python3 fastVoice.py | larynx -v ek --interactive --ssml --raw-stream --cuda --half --max-thread-workers 8 --stdin-format lines --process-on-blank-line| aplay -r 22050 -c 1 -f S16_LE
Where fastVoice.py just dumps the SSML from a socket onto stdin (remember to flush properly ...)
fastVoice.txt

All works very well. Audio generally starts <1s from receiving the message. The question is how to get a phoneme-viseme sequence synced with the audio output.
I can manage to get level 0-ish lipsync by looking at the amplitude of the audio output, but that gives enough info for just the jaw, not the viseme's of the lips.

Do you have any ideas/pointers on how to maintain the responsiveness of "--raw-stream" while getting real-time matching info to generate the matching visemes?