ufal / whisper_streaming

Whisper realtime streaming for long speech-to-text transcription and translation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How did you measure computationally unaware latency?

lifefeel opened this issue · comments

This question is not about code issue.

I read your paper and would like to know more specifically about computationally unaware latency.

In your code, computationally unaware mode does not measure latency. How did you measure latency?

I want to reproduce your result.

Thanks.

I have another code for it. It's described in the paper -- WER align words in hypothesis and gold transcript, and then make diff of their timestamps.
Do you need the code? I think I can publish it.

Yes. It's very helpful to reproduce the same result.

So, here it is:

lat.zip

Simple usage:

python3 latency.py $doc/orto+ts $asr $lan > $asr.lat

  • $doc/orto+ts is the gold transcript, format like this:
3347 3497 Herr
3497 4227 Präsident,
4227 4957 Herr
4957 5337 Kommissar,
5337 5627 verehrte
5627 6117 Kollegen,
6117 6237 die
6237 7177 Diskussion,
  • $asr is the real-time ASR hypothesis file, output of Whisper-Streaming. Format like this:
9594.1408 0 5540 Herr Präsident, Herr Kommissar, verehrte
11769.4223 5560 7940 Kollegen, die Diskussion, die wir
13601.9289 7940 8480 jetzt führen,
16274.7409 8960 12180 zieht darauf ab, Tierversuche zu verringern und
18144.7799 12180 14200 die Bedingungen zu verbessern.

The first number on the line is the emission time of the line in milliseconds. The other two numbers are skipped.

  • $lan is not a file, but a variable containing the lang. code for MosesTokenizer, e.g. "en" or "de"

  • $asr.lat is output file that is created, there's e.g. 5810.91 -- average word latency in milliseconds

I hope it helps. Sorry for so messy scripts, they're not cleaned for publication.

please reopen if you have question or issue