Timestamps are too tight when repetition_penalty is present
Brodski opened this issue · comments
Title says it all. Is there something that could be done to make the timestamps more reasonable so they dont break up mid sentence?
Here is my code and a couple comparisons after it.
pipe = pipeline(
"automatic-speech-recognition",
model=model_size_insane, # large-v3
torch_dtype=torch.float16,
device=my_device,
model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"},
)
generate_kwargs = {
"language": 'en',
"temperature": 0.2,
"repetition_penalty": 3.0,
"task": "transcribe",
}
outputs = pipe(
filename,
chunk_length_s=30,
batch_size=24,
return_timestamps=True,
generate_kwargs = generate_kwargs
)
return outputs
With a little formatting, here is the output of a transcribed section. As you can see in about ~7 seconds the output create timestamps for each word when the speaker was talking slowly:
00:11:29,440 --> 00:11:29,460: To figure out the words you kno?
00:11:29,919 --> 00:11:29,940: Cry..
00:11:32,759 --> 00:11:32,779: Im crying a little bit
00:11:33,879 --> 00:11:33,899: But im can' t
00:11:37,559 --> 00:11:37,580: ...
00:11:37,919 --> 00:11:37,940: Uh
00:11:38,399 --> 00:11:38,419: Don''t
00:11:38,519 --> 00:11:38,539: You
00:11:39,980 --> 00:11:40,000: Know
00:11:40,159 --> 00:11:40,179: Dont
00:11:40,519 --> 00:11:40,539: Dare
00:11:41,080 --> 00:11:41,100: Compliment
00:11:41,259 --> 00:11:41,279: Me
00:11:41,679 --> 00:11:41,700: Ankle
00:11:42,179 --> 00:11:42,200: Man
00:11:42,440 --> 00:11:42,460: Your
00:11:43,440 --> 00:11:43,460: Seriously
00:11:43,639 --> 00:11:43,659: One of
00:11:43,720 --> 00:11:43,740: The
00:11:44,120 --> 00:11:44,139: Nicest
00:11:44,480 --> 00:11:44,500: People
00:11:44,620 --> 00:11:44,639: That
00:11:44,799 --> 00:11:44,820: Ever
00:11:45,080 --> 00:11:45,360: Met Straight Up Don't you dare compliment me. Ankle, man! You're like seriously one of the nicest people I've ever met
00:11:46,620 --> 00:11:47,120: Like straight up
00:11:47,620 --> 00:11:47,720: Straight Up
But if I run the same but without repetition_penalty
, the timestamps are more reasonable:
00:11:25,220 --> 00:11:28,279: It's hard to figure out the words you know?
00:11:29,379 --> 00:11:29,980: Crying
00:11:29,980 --> 00:11:32,759: I'm crying a little bit
00:11:32,759 --> 00:11:33,879: But i can't
00:11:33,879 --> 00:11:34,279: Like
00:11:34,279 --> 00:11:37,840: Uh
00:11:37,840 --> 00:11:38,440: Don' t
00:11:38,440 --> 00:11:41,279: You dare compliment me
00:11:41,279 --> 00:11:41,960: Ankleman
00:11:41,960 --> 00:11:44,440: You're seriously one of nicest people
00:11:44,440 --> 00:11:45,080: Straight up Don't you dare compliment me. Ankle, man! You're like seriously one of the nicest people I've ever met.
00:11:45,580 --> 00:11:46,580: Like straight up.
00:11:47,080 --> 00:11:47,539: Straight up.
00:11:47,600 --> 00:11:47,840: Ankle,
00:11:47,940 --> 00:11:48,419: you are
00:11:48,419 --> 00:11:49,980: like
00:11:49,980 --> 00:11:51,399: One of the nicest dudes
00:11:51,399 --> 00:11:53,840: that I have ever randomly talked to on the internet
It might be nice to have something like condition_on_previous_text=False
and/or vad_filter=True
. I was using that from other repos, like faster-whisper, and their output, though much much slower, was kinda better
00:11:25,220 --> 00:11:28,260: It's hard to, like, figure out the words, you know?
00:11:29,360 --> 00:11:29,800: Cry.
00:11:29,980 --> 00:11:30,660: I'm, um...
00:11:30,660 --> 00:11:33,020: I'm crying a little bit, man.
00:11:33,080 --> 00:11:34,980: But I can't, like...
00:11:35,540 --> 00:11:35,980: I can't...
00:11:37,300 --> 00:11:39,920: I don't, you know...
00:11:39,920 --> 00:11:41,240: Don't you dare compliment me.
00:11:41,340 --> 00:11:45,080: Ankle, man, you're, like, seriously one of the nicest people I've ever met.
00:11:45,560 --> 00:11:46,580: Like, straight up.
00:11:47,080 --> 00:11:47,540: Straight up.
So not a perfect solution, but these configs seem to work pretty well. Based off of this issue, I changed chunk_length_s
to 16 (#115 said a value < 30s will help). I then experimented and found "repetition_penalty": 1.25` worked well.
generate_kwargs = {
"language": 'en',
"repetition_penalty": 1.25, # this helps
"task": "transcribe",
}
outputs = pipe(
filename,
chunk_length_s=16, # this helps too
batch_size=24,
return_timestamps=True,
generate_kwargs = generate_kwargs
)
return outputs
Using:
- 5 hours audio
- large-v3
- RTX A5000 (thanks vast.ai) :
chunk_length_s = 30 ---> Transcribe time 3.5 min
chunck_length_s = 16s ---> Transcribe time 4.36 min
Yall can close this if you want. I'm content with this fix