Live transcription PoC with the Whisper model (using faster-whisper
) in a client-server setup where the server can handle multiple clients.
Sample with a Macbook Pro (M1)
test-transcription-on-m1-mac.mov
(🔈 sound on, faster-whisper
package, base
model - latency was around 0.5s)
$ pip install -r requirements.txt
$ mkdir models
$ python server.py
$ python client.py
There are a few parameters in each script that you can modify
This beautiful art will explain this:
- step = 1
- length = 4
$t$ is the current tie (1 second of audio to be precise)
------------------------------------------
1st second: [t, 0, 0, 0] --> "Hi"
2nd second: [t-1, t, 0, 0] --> "Hi I am"
3rd second: [t-2, t-1, t, 0] --> "Hi I am the one"
4th second: [t-3, t-2, t-1, t] --> "Hi I am the one and only Gabor"
5th second: [t, 0, 0, 0] --> "How" --> Here we started the process again, and the output is in a new line
6th second: [t-1, t, 0, 0] --> "How are"
etc...
------------------------------------------
- Use a
VAD
on the client side, and either send the audio for transcription when we detect a longer silence (e.g. 1 sec) or if there is no silence we can fall back to the maximum length. - Transcribe shorter timeframes to get more instant transcriptions and meanwhile, we can use larger timeframes to "correct" already transcribed parts (async correction)