Timestamps precision in milliseconds?

Question

Timestamps precision in milliseconds?

mirix opened this issue a year ago · comments

Hello,

I am using the sample code provided:

from faster_whisper import WhisperModel

model_size = 'large-v2'
model = WhisperModel(model_size, device='cpu', compute_type='int8')

segments, info = model.transcribe('Michael, Jim, Dwight epic scene [qHrN5Mf5sgo].mp3', beam_size=5)

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
	print('[%.2fs -> %.2fs] %s' % (segment.start, segment.end, segment.text))

And the timestamp precision seems to be one second:

Detected language 'en' with probability 0.988083
[0.00s -> 7.00s]  Here's what's going to happen. I am going to have to fix you, manage you to, on a more
[7.00s -> 13.00s]  personal scale, a more micro form of management. Jim, what is that called?
[13.00s -> 14.00s]  Micro Jimin.
[14.00s -> 19.00s]  Boom. Yes. Now Jim is going to be the client. Dwight, you're going to have to sell to him
[19.00s -> 24.00s]  without being aggressive, hostile, or difficult. Let's go.
[24.00s -> 28.00s]  All right, fine. Ring, ring.
[28.00s -> 29.00s]  Hello?

Would it be possible to report milliseconds?

Another, unrelated, question, if I wished to perform an analysis per segment (say, gender, sentiment, emotion), how should I use the segment object?

Furthermore, I have tried numerous approaches for speaker diarization but all (I could not try Nemo-based ones because I do not have an adequate GPU) and all yields very bad results in certain scenarios when it comes to speaker attribution. I am considering a brute-force approach, any recommendations for a library I could use to compare a segment with the previous one in order to determine whether or not it is the same speaker.

Best,

Ed

Ki Hoon Kim · Answer 1 · Thu Jun 15 2023 21:15:06 GMT+0800 (China Standard Time)

Setting word_timestamps=True causes the timestamps of segments to be displayed in milliseconds.
Even if you don't use word timestamps, setting this option will make the timestamps of all segments more precise in milliseconds.

For diarization you can try the method implemented in the repo below.

https://github.com/JaesungHuh/SimpleDiarization

Guillaume Klein · Answer 2 · Thu Jun 15 2023 21:19:26 GMT+0800 (China Standard Time)

Even without word timestamps, the Whisper model could predict timestamps to a 10 milliseconds precision but one of the Whisper author said that "the predicted timestamps tend to be biased towards integers" (source).

mirix · Answer 3 · Thu Jun 15 2023 21:51:53 GMT+0800 (China Standard Time)

Setting word_timestamps=True causes the timestamps of segments to be displayed in milliseconds. Even if you don't use word timestamps, setting this option will make the timestamps of all segments more precise in milliseconds.

For diarization you can try the method implemented in the repo below.

https://github.com/JaesungHuh/SimpleDiarization

Thanks for the tips. Indeed adding the word_timestamps keyword produces a precision of 10 milliseconds.

I tried the library you suggested, but it seems it does not work for more than two speakers. Or perhaps I am wrong. We will see:

JaesungHuh/SimpleDiarization#1

I have tried many diarization strategies, but, so, far everything based upon pyannote fails:

pyannote/pyannote-audio#1406