Vaibhavs10 / insanely-fast-whisper

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Timestamps are too tight when repetition_penalty is present

Brodski opened this issue · comments

Title says it all. Is there something that could be done to make the timestamps more reasonable so they dont break up mid sentence?

Here is my code and a couple comparisons after it.

    pipe = pipeline(
        "automatic-speech-recognition",
        model=model_size_insane, # large-v3
        torch_dtype=torch.float16,
        device=my_device,
        model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"},
    )
    generate_kwargs = {
        "language": 'en',
        "temperature": 0.2,
        "repetition_penalty": 3.0,
        "task": "transcribe",
    }

    outputs = pipe(
        filename,
        chunk_length_s=30,
        batch_size=24,
        return_timestamps=True,
        generate_kwargs = generate_kwargs
    )
    return outputs

With a little formatting, here is the output of a transcribed section. As you can see in about ~7 seconds the output create timestamps for each word when the speaker was talking slowly:

00:11:29,440 --> 00:11:29,460: To figure out the words you kno?
00:11:29,919 --> 00:11:29,940: Cry..
00:11:32,759 --> 00:11:32,779: Im crying a little bit
00:11:33,879 --> 00:11:33,899: But im can' t
00:11:37,559 --> 00:11:37,580: ...
00:11:37,919 --> 00:11:37,940: Uh
00:11:38,399 --> 00:11:38,419: Don''t
00:11:38,519 --> 00:11:38,539: You
00:11:39,980 --> 00:11:40,000: Know
00:11:40,159 --> 00:11:40,179: Dont
00:11:40,519 --> 00:11:40,539: Dare
00:11:41,080 --> 00:11:41,100: Compliment
00:11:41,259 --> 00:11:41,279: Me
00:11:41,679 --> 00:11:41,700: Ankle
00:11:42,179 --> 00:11:42,200: Man
00:11:42,440 --> 00:11:42,460: Your
00:11:43,440 --> 00:11:43,460: Seriously
00:11:43,639 --> 00:11:43,659: One of
00:11:43,720 --> 00:11:43,740: The
00:11:44,120 --> 00:11:44,139: Nicest
00:11:44,480 --> 00:11:44,500: People
00:11:44,620 --> 00:11:44,639: That
00:11:44,799 --> 00:11:44,820: Ever
00:11:45,080 --> 00:11:45,360: Met Straight Up Don't you dare compliment me. Ankle, man! You're like seriously one of the nicest people I've ever met
00:11:46,620 --> 00:11:47,120: Like straight up
00:11:47,620 --> 00:11:47,720: Straight Up

But if I run the same but without repetition_penalty, the timestamps are more reasonable:

00:11:25,220 --> 00:11:28,279: It's hard to figure out the words you know?
00:11:29,379 --> 00:11:29,980: Crying
00:11:29,980 --> 00:11:32,759: I'm crying a little bit
00:11:32,759 --> 00:11:33,879: But i can't
00:11:33,879 --> 00:11:34,279: Like
00:11:34,279 --> 00:11:37,840: Uh
00:11:37,840 --> 00:11:38,440: Don' t
00:11:38,440 --> 00:11:41,279: You dare compliment me
00:11:41,279 --> 00:11:41,960: Ankleman
00:11:41,960 --> 00:11:44,440: You're seriously one of nicest people
00:11:44,440 --> 00:11:45,080: Straight up Don't you dare compliment me. Ankle, man! You're like seriously one of the nicest people I've ever met.
00:11:45,580 --> 00:11:46,580: Like straight up.
00:11:47,080 --> 00:11:47,539: Straight up.
00:11:47,600 --> 00:11:47,840: Ankle,
00:11:47,940 --> 00:11:48,419: you are
00:11:48,419 --> 00:11:49,980: like
00:11:49,980 --> 00:11:51,399: One of the nicest dudes
00:11:51,399 --> 00:11:53,840: that I have ever randomly talked to on the internet

It might be nice to have something like condition_on_previous_text=False and/or vad_filter=True. I was using that from other repos, like faster-whisper, and their output, though much much slower, was kinda better

00:11:25,220 --> 00:11:28,260: It's hard to, like, figure out the words, you know?
00:11:29,360 --> 00:11:29,800: Cry.
00:11:29,980 --> 00:11:30,660: I'm, um...
00:11:30,660 --> 00:11:33,020: I'm crying a little bit, man.
00:11:33,080 --> 00:11:34,980: But I can't, like...
00:11:35,540 --> 00:11:35,980: I can't...
00:11:37,300 --> 00:11:39,920: I don't, you know...
00:11:39,920 --> 00:11:41,240: Don't you dare compliment me.
00:11:41,340 --> 00:11:45,080: Ankle, man, you're, like, seriously one of the nicest people I've ever met.
00:11:45,560 --> 00:11:46,580: Like, straight up.
00:11:47,080 --> 00:11:47,540: Straight up.

So not a perfect solution, but these configs seem to work pretty well. Based off of this issue, I changed chunk_length_s to 16 (#115 said a value < 30s will help). I then experimented and found "repetition_penalty": 1.25` worked well.

    generate_kwargs = {
        "language": 'en',
        "repetition_penalty": 1.25, # this helps
        "task": "transcribe",
    }
    outputs = pipe(
        filename,
        chunk_length_s=16,  # this helps too
        batch_size=24,
        return_timestamps=True,
        generate_kwargs = generate_kwargs
    )
    return outputs

Using:

  • 5 hours audio
  • large-v3
  • RTX A5000 (thanks vast.ai) :
chunk_length_s = 30 ---> Transcribe time 3.5 min
chunck_length_s =  16s ---> Transcribe time 4.36 min

Yall can close this if you want. I'm content with this fix