Voice Activity Controller

Question

Voice Activity Controller

rodrigoGA opened this issue 7 months ago · comments

Hello, I have found your project interesting, good job.

I believe there is an incorrect use of VAD. The function get_speech_timestamps used by fasterwhisper is a copy of the function from silero which is intended for complete audios. However, when working with streaming, audio fragments are being received. Silero already includes a utility for this at https://github.com/snakers4/silero-vad/blob/5e7ee10ee065ab2b98751dd82b28e3c6360e19aa/utils_vad.py#L428

I have forked your project to test this: https://github.com/rodrigoGA/whisper_streaming/tree/main
Changing the way VAD is used seemed to improve the results.

One of the main drawbacks I found is the delay in obtaining the transcription, which gives an unpleasant feeling, especially when the conversation ends, as no transcription is received for a few seconds. Therefore, I created a class based on VAD to flush the buffer once it detects that the user has not spoken for 0.5 seconds https://github.com/rodrigoGA/whisper_streaming/blob/main/voice_activity_controller.py
In this file, you can find an example that transcribes from the microphone: https://github.com/rodrigoGA/whisper_streaming/blob/main/mic_test_whisper_streaming.py
It greatly improves the feeling of real-time transcription, perhaps a similar idea can be applied. I say feeling because I haven't done any serious performance testing.

I've also created a simple example that transcribes when the user stops talking to compare results: https://github.com/rodrigoGA/whisper_streaming/blob/main/mic_test_whisper_simple.py

Another point I think you should consider is the tokens you are using. In languages like Spanish, questions are enclosed in question marks at the beginning and end, and can have other punctuation marks in the middle. For example, sentences like this: "¿Cuál es la capital de Francia, y por qué es conocida por su arquitectura?" However, in some situations, your approach has transcribed it as: "cual es la capital de Francia, ¿por qué es conocida por su arquitectura?" It might be a problem with whisper, but I think it's the use of tokens you have applied.

Dominik Macháček · Answer 1 · Mon Dec 04 2023 18:34:22 GMT+0800 (China Standard Time)

Wow, thank you, @rodrigoGA ! This is very interesting feedback. I want to review and test your approach and possibly merge the useful parts. Later, when I'll have time.
Thanks!

Rodrigo · Answer 2 · Mon Dec 04 2023 19:42:23 GMT+0800 (China Standard Time)

Should the suggestion be integrated, I would also suggest changing the way the translation is returned. All streaming systems in some way indicate whether it is a partial or final translation. In this way, what is in the buffer could be returned as partial, and the user would have a more realistic feedback of what is being said. It is understood that the partial can change.

Dominik Macháček · Answer 3 · Mon Dec 04 2023 19:52:41 GMT+0800 (China Standard Time)

yes, an option for |||-separated partial output is possible. But anyway, I don't want more complicated output protocol. Plaintext is enough.

Rodrigo · Answer 4 · Mon Dec 04 2023 20:49:06 GMT+0800 (China Standard Time)

I understand the idea of keeping it simple. However, this is the standard in streaming ASR. You can check how Nvidia uses 'is_final' for all streaming models supported by the Riva platform https://docs.nvidia.com/deeplearning/riva/user-guide/docs/reference/protos/protos.html#_CPPv428SpeechRecognitionAlternative or companies that sell the model as a service in streaming APIs https://www.assemblyai.com/docs/guides/real-time-streaming-transcription
All of them use the same concept. As a consumer of these services, I can tell you that this is very useful for knowing when the user is speaking and for getting feedback on what is happening, even though the transcription has not finished. Imagine you want to use an ASR in a real-world use case, for example, transcribing a phone call. You would need to know when the user stops speaking and that the transcription is finished in order to do something with the text. Otherwise, you would have to wait until the call ends to consider the transcription complete, which would lose the aspect of real-time

Dominik Macháček · Answer 5 · Wed Feb 07 2024 00:13:51 GMT+0800 (China Standard Time)

@rodrigoGA , thank you very much again. In integrated your VAC in https://github.com/ufal/whisper_streaming/tree/vad-streaming It seems working good, but the code needs to be reviewed and made clearer and simpler. Then I can merge it.