m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

V3 sentence segement issue

seset opened this issue · comments

commented

V3 is now incrediably fast, maybe dozens of times faster
but now the subtitles of each paragraph are too long . examples as below :
1
00:00:00,730 --> 00:00:26,190
Are you nervous? It's a good nervous, a happy nervous. Yeah, Matt's an incredible man, and it's obvious he's very much in love with you. I know. He's a little bit nervous too, you know, but he's holding up great. I'm glad. Yeah. Okay, well, I'm gonna go downstairs and work on those decorations. If you need anything, let me know. Love you, Mom. Love you, sweetheart. Oh, thank you.

2
00:00:34,973 --> 00:01:02,883
Oh, Jesus Christ. I thought you'd never leave. You need to get out of here, Ben. Like, right now. She could have caught us. But she didn't. And we weren't doing anything anyway. And who cares even if we were? I care? Okay, I'm not falling for one of your I wanna be with you routines. You only want me because you can't have me. Yeah, and you're only resisting because you're too preoccupied with being a petty rule follower. I am not.

Add the following

--max_line_width 42 --max_line_count 2

commented

Add the following

--max_line_width 42 --max_line_count 2

thanks ,it can somehow impove...and when i try japanese or chinese file source, there is a lot repetitive words/letters/characters in srt and vtt output, while the json and txt content are correct, a chinese example mixed with english as below for your easying understanding what went wrong. Since the JSON file is correct, hopefully this minor error can be fixed soon:)

correct content(json and txt):-----------------------------------------------------
[JSON]
[segments]
[0]
[start : 0.6180930656934307]
[end : 22.254860401459855]
[text : "如何巧妙的去使用多个不同的ControlNet并用到Stable Diffusion的一些最新插件去精准的控制画面 生成一些这样的图片我研究了一整个星期 尝试了MultiControlNet不同的排列组合调整了很多不同的参数 把坑都踩过了一遍给你们整理了12种高阶用法的合计和所需要的所有参数"]
[words]

wrong and repetitive (srt,vtt):------------------------------------------------------
1
00:00:00,618 --> 00:00:22,235
如何 何巧 巧妙 妙的 的去 去使 使用 用多 多个 个不 不同 同的 的C Co on nt tr ro ol lN Ne et t并 并用 用到 到S St ta ab bl le e Di if ff fu us si io on n的 的一 一些 些最 最新 新插 插件 件去 去精 精准 准的 的控 控制 制画 画面 面 生成 成一 一些 些这 这样 样的 的图 图片 片我 我研 研究 究了 了一 一整 整个 个星 星期 期 尝试 试了 了M Mu ul lt ti iC Co on nt tr ro ol lN Ne et t不 不同 同的 的排 排列 列组 组合 合调 调整 整了 了很 很多 多不 不同 同的 的参 参数 数 把坑 坑都 都踩 踩过 过了 了一 一遍 遍给 给你 你们 们整 整理 理了 了12 2种高 高阶 阶用 用法 法的 的合 合计 计和 和所 所需 需要 要的 的所 所有 有参 参数 数

Add the following

--max_line_width 42 --max_line_count 2

this works for me, thanks 🙏

commented

Although these options help somewhat, I'd definitely say v3 produces worse subtitle formatting than the older version, often breaking single words off sentences because there's no obvious way to find the perfect max line width and joining unrelated chunks that just worked before. Do you see any way to improve this? Here's an example:

Old version:

~~ Transcribing VAD chunk: (00:05.324 --> 00:34.045) ~~
[00:00.000 --> 00:01.840] Ik heb jullie zeer. Spreek voor jezelf, hè.
[00:01.920 --> 00:03.880] Attention, please. This is Lancelot.
[00:03.960 --> 00:06.160] Clap, clap. Switch seats.
[00:06.240 --> 00:07.920] Keep quets. Perfect.
[00:09.800 --> 00:12.680] Zoek ik al een boek over België. Ik ben de weg kwijt.
[00:17.080 --> 00:19.480] Ik ben het de man.
[00:19.560 --> 00:21.080] Je ziet het rare, jongen.
[00:23.360 --> 00:24.480] Hallo. Hallo.
[00:24.560 --> 00:25.840] Jullie zijn met een bus aan het rijden.
[00:25.920 --> 00:26.640] Kijk, ja.
[00:26.720 --> 00:28.120] Op die bus in een boom.
[00:28.200 --> 00:29.200] Wat?

New version:

0:00:05.40,0:00:06.82: Ik heb jullie zeer. Spreek voor jezelf,
0:00:06.84,0:00:09.28: hè. Attention, please. This is Lancelot.
0:00:09.34,0:00:12.66: Clap, clap. Switch seats. Keep quets.
0:00:12.70,0:00:16.73: Perfect. Zoek ik al een boek over België.
0:00:16.81,0:00:22.51: Ik ben de weg kwijt.
0:00:22.55,0:00:25.71: Ik ben het de man. Je ziet het raar,
0:00:25.75,0:00:29.36: jongen. Hallo.
0:00:29.40,0:00:31.30: Hallo. Jullie zijn met een bus aan het
0:00:31.44,0:00:34.00: rijden? Ja. Op die bus in een boom? Wat?

Yes thanks for reporting, I found similar. Unfortunately the natural segments from whisper cannot be extracted in the current batched method.

Definitely the logic for post-processing the 30s chunks into segments needs to be improved. I would suggest the following:

  1. Use nltk toolbox, nltk.sent_tokenize, tokenize the text into sentences (create segment for each sentence)
  2. For long sentences, these can be further broken at comma locations.

Unfortunately my spare time is working on improving diarization right now, but feel free to send a pull request with improvements to this logic

commented

Anyone working on a better segmentation right now? Otherwise I'd take a look at it.

so actually im on it atm as need sentence segments for diarization (also alignment logic needed cleaning up), will push in an hour or so

Should be hopefully fixed here 24008aa

Sometimes nltk.sent_tokenize can create too short segments, but I found its good. Also improves the diarization

commented

Thanks a lot @m-bain, it's much improved. No broken sentences, though I do see the short segments appearing. Could probably be fixed by combining those segments together as a post-processing step, doesn't necessarily need to be in WhisperX.

Here are the results of the same file on the new version:

00:00:05,404 --> 00:00:06,165 Ik heb jullie zeer.
00:00:06,165 --> 00:00:06,985 Spreek voor jezelf, hè.
00:00:06,985 --> 00:00:08,246 Attention, please.
00:00:08,246 --> 00:00:09,347 This is Lancelot.
00:00:09,347 --> 00:00:10,228 Clap, clap.
00:00:10,228 --> 00:00:11,689 Switch seats.
00:00:11,689 --> 00:00:12,709 Keep quets.
00:00:12,709 --> 00:00:15,291 Perfect.
00:00:15,291 --> 00:00:16,812 Zoek ik al een boek over België.
00:00:16,812 --> 00:00:22,557 Ik ben de weg kwijt.
00:00:22,557 --> 00:00:24,978 Ik ben het de man.
00:00:24,978 --> 00:00:26,139 Je ziet het raar, jongen.
00:00:26,139 --> 00:00:29,402 Hallo.
00:00:29,402 --> 00:00:29,782 Hallo.
00:00:29,782 --> 00:00:31,803 Jullie zijn met een bus aan het rijden?
00:00:31,803 --> 00:00:32,064 Ja.
00:00:32,064 --> 00:00:33,685 Op die bus in een boom?
00:00:33,685 --> 00:00:34,005 Wat?

The biggest issue I see now is that each subtitle's end time appears to be the start of the next one even when this isn't accurate.

eg. The "Perfect" line above on the old version ended up at 13.244s which is accurate, and on v3 it stays on-screen over 2 seconds longer until 15.291s.

I still have the same problem even in the newest version (tried on ja and zh). Everything comes out in huge chunks.
Python 3.10.11
whisperx --model medium --language ja --compute_type int8 filename.ext

The process goes from
Performing transcription...
right into
Performing alignment...
withouth displaying any timestamps.

commented

Same here. V3.1. Sentences are too long...

commented

Should be hopefully fixed here 24008aa

Sometimes nltk.sent_tokenize can create too short segments, but I found its good. Also improves the diarization

thanks for the massive code improvement! And after updated i found below:

  1. for Englisth and German,the segements is greatly improved, almost like natural segments from original whisper.
  2. for fr , ja and zh, there is still no much improvement on segements length. especially for ja and zh these kind of syllabary/ideographic characters, i noticed the updated code for eliminate repetitive characters, but the "extra space" between makes the ja and zh segements displays even more longer. so i use --no_align for ja and zh for now.
  3. when i try to use diarization to see whether helpful to segements, errors below, no much luck in using --diarize ever since v3 updated.

File "D:\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\Python\Python310\Scripts\whisperx.exe_main
.py", line 7, in
File "D:\Python\Python310\lib\site-packages\whisperx\transcribe.py", line 192, in cli
diarize_model = DiarizationPipeline(use_auth_token=hf_token, device=device)
File "D:\Python\Python310\lib\site-packages\whisperx\diarize.py", line 16, in init
self.model = Pipeline.from_pretrained(model_name, use_auth_token=use_auth_token).to(device)
File "D:\Python\Python310\lib\site-packages\pyannote\pipeline\pipeline.py", line 100, in getattr
raise AttributeError(msg)
AttributeError: 'SpeakerDiarization' object has no attribute 'to'

I am having the same problem with English as well.

commented

Hi @seset

  1. for Englisth and German,the segements is greatly improved, almost like natural segments from original whisper.

you fixed it? We are facing same issue. but not able to fix it....

You can also just write your own script for merging word level timestamps into sentence level timestamps, if you want I can provide my script

Add the following

--max_line_width 42 --max_line_count 2

how to use in code

audio = whisperx.load_audio(audio_file)
print("end load audio")
result = model.transcribe(audio, batch_size=batch_size, max_line_width=42, max_line_count=2)

Is there any solution to this problem?
AttributeError: 'SpeakerDiarization' object has no attribute 'to'
I got a failure in here:
diarize_model = DiarizationPipeline(use_auth_token=HF_TOKEN, device=DEVICE)

Thanks!

commented

@Omer-ler are you using the correct pyannote and whisperX versions? Try it in a clean environment maybe.

I've been tearing my hair out trying to figure out why this is happening. I'm getting the no attrribute "to" error code as well.

I've tried just using pyannote, I've tried with whisperx, I've tried in clean anvironments and I've tried without.

commented

How do you install whisperx? Using a clean environment and running
pip install git+https://github.com/m-bain/whisperx.git
should work.

I'm still getting long segments in the latest, consisting of several sentences per segment rather than one sentence per segment. Is this expected? I've seen this with both English and French transcriptions.

Should be hopefully fixed here 24008aa

Sometimes nltk.sent_tokenize can create too short segments, but I found its good. Also improves the diarization

@sw5813 Hmmm if there is a full-stop then this isn't expected (the sentence tokenization should split up multiple sentences). Can you print some examples?

Overall there is a trade-off here: Batch inference provides. a big speedup but loses Whisper's shortert segment timestamps.

For the next big update I will try add functionality to support both ASR backends:

  1. native whisper (unbatched) with original timestamps
  2. batched inference with faster-whisper without original timestamps (can cause 30s long segments, especially for non-english)

Since for some people shorter segments are most important and not speed so 1 might be a better ASR backend.

Sure, here's one of the outputs I got as a result of the transcription (before the alignment step):

{'segments': [
{'text': " qui permettent d'avoir un échantillon beaucoup plus vaste de patientes et d'être éventuellement plus représentatif de ce qu'on va pouvoir avoir finalement dans la vraie vie et avec les patientes qu'on va traiter. Donc c'est des choses qui sont parfaitement, qui peuvent se combiner, c'est deux types d'études totalement différentes.", 'start': 0.008, 'end': 15.483},
{'text': " mais qui vont donner aussi des informations différentes. Donc les deux avec leurs avantages, leurs inconvénients. Donc les résultats pour one hundred and fifty three thousand six hundred femmes qui ont réalisé two hundred and forty five thousand five hundred and thirty four ovarian stimulation. Donc c'est vraiment représentatif de l'AMP française. Donc entre le premier janvier deux mille treize et le trente et un décembre deux mille dix-huit. Et l'âge moyen de ces femmes était de trente-quatre virgule zéro sept ans. Donc ce qui est tout à fait en rapport avec les pratiques.", 'start': 15.483, 'end': 45.47},
{'text': " Le Système National des Données de Santé, ce fameux SNDS, est constitué des données de l'assurance-maladie, en fait, et exhaustif puisqu'on couvre, pardon, de la population au niveau de la France. Donc c'est quelque chose.", 'start': 45.47, 'end': 59.948}
], 'language': 'fr'}

FWIW I used the "suppress_numerals" setting which is why the numbers are written out, although I wonder if that may also be why there's some English that made its way into this French transcription...

commented

After few months waiting , Whisperx is still the fastest and best!

My temp solution for verbose segment issue as below:

step 1: install whisperx in editable mode:

$ git clone https://github.com/m-bain/whisperX.git
$ cd whisperX
$ pip install -e .

step 2: fix segement duration problem
edit below line in "asr.py" , change "30" into "8", i tried 5-10 seconds , the length of subs will be all acceptable.
suggest @m-bain to add argument for this...

vad_segments = merge_chunks(vad_segments, 30)

https://github.com/m-bain/whisperX/blame/1b092de19a1878a8f138f665b1467ca21b076e7e/whisperx/asr.py#L263

step3: use "--no_align" to fix extra empty space between when transcribe zh, ja or other language , or to edit "transribe.py" to set it as default. because i don't see that much break lines when not useing alignment, totally acceptable..

commented

Hi @seset

  1. for Englisth and German,the segements is greatly improved, almost like natural segments from original whisper.

you fixed it? We are facing same issue. but not able to fix it....

refer above...

commented

After few months waiting , Whisperx is still the fastest and best!

My temp solution for verbose segment issue as below:

step 1: install whisperx in editable mode:

$ git clone https://github.com/m-bain/whisperX.git $ cd whisperX $ pip install -e .

step 2: fix segement duration problem edit below line in "asr.py" , change "30" into "8", i tried 5-10 seconds , the length of subs will be all acceptable. suggest @m-bain to add argument for this...

vad_segments = merge_chunks(vad_segments, 30)

https://github.com/m-bain/whisperX/blame/1b092de19a1878a8f138f665b1467ca21b076e7e/whisperx/asr.py#L263

step3: use "--no_align" to fix extra empty space between when transcribe zh, ja or other language , or to edit "transribe.py" to set it as default. because i don't see that much break lines when not useing alignment, totally acceptable..

@seset The --chunk_size argument is added at #445. Please check if this issue is resolved.
And the ja, zh space issue is also resolved #248.