whisper.cpp

Question

whisper.cpp

BahamutRU opened this issue 4 months ago · comments

Bahamut commented 4 months ago

Rofl?
Why not faster-whisper?
Faster, smaller, better.

And streaming mode is terrible, sure.
But, for speed…

Software is usual and simple, but all-in-one.

Try faster-whisper. =)

Mozer · Answer 1 · Sat Apr 13 2024 01:27:44 GMT+0800 (China Standard Time)

How fast is it for a short phrase? (e.g. How are you?).
I have to check, but i don't think it will be faster than 0.22s that i managed to get with distilled whisper cpp medium in English.

Mozer · Answer 2 · Sat Apr 13 2024 21:05:05 GMT+0800 (China Standard Time)

I have tested it. With default settings faster whisper is a little bit slower than whisper.cpp in my project for short phrases. I am getting 0.26s for faster whisper and 0.23s for whisper.cpp. whisper.cpp uses 2.0 GB of vram and faster whisper - 1.2 GB.

For long phrases faster whisper is better. But the main usage in my project - is to transcribe short phrases.

Code for distilled medium en. (not distilled is slower and takes even more vram)

from faster_whisper import WhisperModel
import time

model_size = "distil-medium.en"

model = WhisperModel(model_size, device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.wav", beam_size=5, language="en", condition_on_previous_text=False)

print(time.time())
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
print(time.time())   
segments, info = model.transcribe("audio.wav", beam_size=5, language="en", condition_on_previous_text=False) 
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
print(time.time())    
segments, info = model.transcribe("audio.wav", beam_size=5, language="en", condition_on_previous_text=False) 
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
print(time.time())

>python test.py
1713012570.1575649
[0.00s -> 2.00s]  fucking music or what?
1713012570.5819616
[0.00s -> 2.00s]  fucking music or what?
1713012570.8380058
[0.00s -> 2.00s]  fucking music or what?
1713012571.093508

Mozer · Answer 3 · Sat Apr 13 2024 21:15:32 GMT+0800 (China Standard Time)

There is also whisperX that can do inference in batches, but it will be using lots of vram.

Bahamut · Answer 4 · Sun Apr 14 2024 14:24:19 GMT+0800 (China Standard Time)

0.26s for faster whisper and 0.23s for whisper.cpp. whisper.cpp uses 2.0 GB of vram and faster whisper - 1.2 GB.

Hm, I see.
You split the whole speech into small pieces, for faster reaction? And this 0.03s transform into multiple delay?
Size of vram is not a priority?

Okay, it's you business. =)

Sorry, you're right.

GL!

Bahamut · Answer 5 · Sun Apr 14 2024 14:26:28 GMT+0800 (China Standard Time)

And about xtts, if you use streaming, this makes quality bad. =) But, no another option, I understand…

Dmitry Mikushin · Answer 6 · Mon Apr 15 2024 19:41:48 GMT+0800 (China Standard Time)

Thank you both for this discussion. I've added highlights to the README: dmikushin@895c324