Mozer / talk-llama-fast

Port of OpenAI's Whisper model in C/C++ with xtts and wav2lip

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

whisper.cpp

BahamutRU opened this issue · comments

Rofl?
Why not faster-whisper?
Faster, smaller, better.

And streaming mode is terrible, sure.
But, for speed…

Software is usual and simple, but all-in-one.

Try faster-whisper. =)

How fast is it for a short phrase? (e.g. How are you?).
I have to check, but i don't think it will be faster than 0.22s that i managed to get with distilled whisper cpp medium in English.

I have tested it. With default settings faster whisper is a little bit slower than whisper.cpp in my project for short phrases. I am getting 0.26s for faster whisper and 0.23s for whisper.cpp. whisper.cpp uses 2.0 GB of vram and faster whisper - 1.2 GB.

For long phrases faster whisper is better. But the main usage in my project - is to transcribe short phrases.

Code for distilled medium en. (not distilled is slower and takes even more vram)

from faster_whisper import WhisperModel
import time

model_size = "distil-medium.en"

model = WhisperModel(model_size, device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", beam_size=5, language="en", condition_on_previous_text=False)

print(time.time())
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
print(time.time())   
segments, info = model.transcribe("audio.wav", beam_size=5, language="en", condition_on_previous_text=False) 
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
print(time.time())    
segments, info = model.transcribe("audio.wav", beam_size=5, language="en", condition_on_previous_text=False) 
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
print(time.time()) 
>python test.py
1713012570.1575649
[0.00s -> 2.00s]  fucking music or what?
1713012570.5819616
[0.00s -> 2.00s]  fucking music or what?
1713012570.8380058
[0.00s -> 2.00s]  fucking music or what?
1713012571.093508

There is also whisperX that can do inference in batches, but it will be using lots of vram.

0.26s for faster whisper and 0.23s for whisper.cpp. whisper.cpp uses 2.0 GB of vram and faster whisper - 1.2 GB.

Hm, I see.
You split the whole speech into small pieces, for faster reaction? And this 0.03s transform into multiple delay?
Size of vram is not a priority?

Okay, it's you business. =)

Sorry, you're right.

GL!

And about xtts, if you use streaming, this makes quality bad. =) But, no another option, I understand…

Thank you both for this discussion. I've added highlights to the README: dmikushin@895c324