batching inference and forced decoding for speedup and multi-target

Question

batching inference and forced decoding for speedup and multi-target

Gldkslfmsd opened this issue 6 months ago · comments

Dominik Macháček commented 6 months ago

Batching inference should be used in Whisper-Streaming. It's currently not implemented.

This could work: huggingface/transformers#27658

if "forced decoding" really works for Whisper, it should help to avoid re-processing the current buffer from start of segment, and it should be faster

Why batching:

If more than chunk-size audio is accumulated, process a batch of the full audio buffer, and the buffer minus chunk size. Then apply local agreement as on the two subsequent iterations. It will be faster.
it could enable joint transcription and translation on one GPU. It might be slower than separately -- due to padding, one of them might have short buffer and the other long. But not so much with forced decoding. And it might be good anyway
it could enable multiple clients in one instance

vidalfer · Answer 1 · Wed Jan 31 2024 21:13:15 GMT+0800 (China Standard Time)

Hi! There's an implementation that supports batch inference: https://github.com/Vaibhavs10/insanely-fast-whisper
I'm not sure if it can be easily implemented in the Whisper streaming project

Dominik Macháček · Answer 2 · Wed Jan 31 2024 21:49:01 GMT+0800 (China Standard Time)

Hi! There's an implementation that supports batch inference: https://github.com/Vaibhavs10/insanely-fast-whisper

I'm not sure if it can be easily implemented in the Whisper streaming project

yes, me neither. I would need a pointer to the function that takes two audio samples and processes them at once.

Dominik Macháček · Answer 3 · Wed Jan 31 2024 22:38:08 GMT+0800 (China Standard Time)

OK, I checked it. The Insanely Fast Whisper is just a wrapper of Huggingface Transformers. The example usage of batching is huggingface/transformers#27658 .

This https://github.com/pe-trik/transformers/blob/online_decode/examples/pytorch/online-decoding/whisper-online-demo.py shows the forced decoding.

So these are the initial points to work on this issue. I might do it in a few weeks, but anybody can go on :)

João Gabriel Junqueira · Answer 4 · Tue Feb 20 2024 23:53:12 GMT+0800 (China Standard Time)

OK, I checked it. The Insanely Fast Whisper is just a wrapper of Huggingface Transformers. The example usage of batching is huggingface/transformers#27658 .

This https://github.com/pe-trik/transformers/blob/online_decode/examples/pytorch/online-decoding/whisper-online-demo.py shows the forced decoding.

So these are the initial points to work on this issue. I might do it in a few weeks, but anybody can go on :)

Any news about this implementation or any finds so we can try to work on that? I am trying to build a multi-client server and batching would be nice to run more than one transcript at the same instance

Dominik Macháček · Answer 5 · Wed Feb 21 2024 00:09:39 GMT+0800 (China Standard Time)

Wow, great!
No news, except that batching has becoming also my priority :) Let's cooperate. I want to start later this week. My first step will be a jupyter notebook where I'll quickly inspect and prototype. It will be messy. Then I isolate the working solution into this repo.

The easiest use case for batching is decode the same audio twice, the whole buffer + the whole minus last chunk.

João Gabriel Junqueira · Answer 6 · Wed Feb 21 2024 03:16:03 GMT+0800 (China Standard Time)

Sure, let's cooperate! My doubt is: decode the same audio twice is for speedup use case, right? I check you mention about multi-client in #42 and would it be necessary to decode + batching backend API to parallelize multiple audios in GPU? I could try to work in this batching backend layer using whisper-streaming source code.

Dominik Macháček · Answer 7 · Wed Feb 21 2024 20:13:20 GMT+0800 (China Standard Time)

Sure, let's cooperate! My doubt is: decode the same audio twice is for speedup use case, right?

yes. Just be aware that batching multiple audios can result in slow down. There will be independent audio buffers of different lengths. You need to pad the audio input to the longest, and the processing time is the same as the longest. So you gain effectiveness, but lose some speed.

Dominik Macháček · Answer 8 · Thu Feb 22 2024 00:38:07 GMT+0800 (China Standard Time)

So, how's your progress, @joaogabrieljunq ?
I found this today: https://github.com/m-bain/whisperX/blob/main/whisperx/asr.py They know how to use batching with faster-whisper. I hope I can reuse this code. And I found that huggingface transformers enable batching with Whisper, but most probably not with word-level timestamps. And they're really necessary with Whisper-Streaming.

João Gabriel Junqueira · Answer 9 · Thu Feb 22 2024 02:32:15 GMT+0800 (China Standard Time)

Hello again @Gldkslfmsd, nice to know that you are progressing in batch implementation research! I spent yesterday researching also about possible implementations for this. Found WhisperS2T that seems to implement dynamic time length support in batch inference, helping in the pad problem that you mentioned above. Perhaps this could help also. https://github.com/shashikg/WhisperS2T/blob/main/whisper_s2t/backends/ctranslate2/model.py

SalomonKisters · Answer 10 · Thu Apr 04 2024 12:37:57 GMT+0800 (China Standard Time)

Any news on this matter?

Arghya Bhatttacharya · Answer 11 · Mon Jun 03 2024 11:17:41 GMT+0800 (China Standard Time)

Any update on batching?

Dominik Macháček · Answer 12 · Mon Jun 03 2024 15:32:44 GMT+0800 (China Standard Time)

no. Unfortunately it's not among my priorities anymore.