V3 sentence segement issue

Question

V3 sentence segement issue

seset opened this issue a year ago · comments

V3 is now incrediably fast, maybe dozens of times faster
but now the subtitles of each paragraph are too long . examples as below :
1
00:00:00,730 --> 00:00:26,190
Are you nervous? It's a good nervous, a happy nervous. Yeah, Matt's an incredible man, and it's obvious he's very much in love with you. I know. He's a little bit nervous too, you know, but he's holding up great. I'm glad. Yeah. Okay, well, I'm gonna go downstairs and work on those decorations. If you need anything, let me know. Love you, Mom. Love you, sweetheart. Oh, thank you.

2
00:00:34,973 --> 00:01:02,883
Oh, Jesus Christ. I thought you'd never leave. You need to get out of here, Ben. Like, right now. She could have caught us. But she didn't. And we weren't doing anything anyway. And who cares even if we were? I care? Okay, I'm not falling for one of your I wanna be with you routines. You only want me because you can't have me. Yeah, and you're only resisting because you're too preoccupied with being a petty rule follower. I am not.

Max Bain · Answer 1 · Sun Apr 30 2023 16:14:08 GMT+0800 (China Standard Time)

Add the following

--max_line_width 42 --max_line_count 2

seset · Answer 2 · Sun Apr 30 2023 22:34:52 GMT+0800 (China Standard Time)

Add the following

--max_line_width 42 --max_line_count 2

thanks ,it can somehow impove...and when i try japanese or chinese file source, there is a lot repetitive words/letters/characters in srt and vtt output, while the json and txt content are correct, a chinese example mixed with english as below for your easying understanding what went wrong. Since the JSON file is correct, hopefully this minor error can be fixed soon:)

correct content(json and txt):-----------------------------------------------------
[JSON]
[segments]
[0]
[start : 0.6180930656934307]
[end : 22.254860401459855]
[text : "如何巧妙的去使用多个不同的ControlNet并用到Stable Diffusion的一些最新插件去精准的控制画面生成一些这样的图片我研究了一整个星期尝试了MultiControlNet不同的排列组合调整了很多不同的参数把坑都踩过了一遍给你们整理了12种高阶用法的合计和所需要的所有参数"]
[words]

wrong and repetitive (srt,vtt):------------------------------------------------------
1
00:00:00,618 --> 00:00:22,235
如何何巧巧妙妙的的去去使使用用多多个个不不同同的的C Co on nt tr ro ol lN Ne et t并并用用到到S St ta ab bl le e Di if ff fu us si io on n的的一一些些最最新新插插件件去去精精准准的的控控制制画画面面生成成一一些些这这样样的的图图片片我我研研究究了了一一整整个个星星期期尝试试了了M Mu ul lt ti iC Co on nt tr ro ol lN Ne et t不不同同的的排排列列组组合合调调整整了了很很多多不不同同的的参参数数把坑坑都都踩踩过过了了一一遍遍给给你你们们整整理理了了12 2种高高阶阶用用法法的的合合计计和和所所需需要要的的所所有有参参数数

Jack Cloudman · Answer 3 · Mon May 01 2023 06:25:22 GMT+0800 (China Standard Time)

Add the following

--max_line_width 42 --max_line_count 2

this works for me, thanks 🙏

jeybee · Answer 4 · Fri May 05 2023 02:09:06 GMT+0800 (China Standard Time)

Although these options help somewhat, I'd definitely say v3 produces worse subtitle formatting than the older version, often breaking single words off sentences because there's no obvious way to find the perfect max line width and joining unrelated chunks that just worked before. Do you see any way to improve this? Here's an example:

Old version:

~~ Transcribing VAD chunk: (00:05.324 --> 00:34.045) ~~
[00:00.000 --> 00:01.840] Ik heb jullie zeer. Spreek voor jezelf, hè.
[00:01.920 --> 00:03.880] Attention, please. This is Lancelot.
[00:03.960 --> 00:06.160] Clap, clap. Switch seats.
[00:06.240 --> 00:07.920] Keep quets. Perfect.
[00:09.800 --> 00:12.680] Zoek ik al een boek over België. Ik ben de weg kwijt.
[00:17.080 --> 00:19.480] Ik ben het de man.
[00:19.560 --> 00:21.080] Je ziet het rare, jongen.
[00:23.360 --> 00:24.480] Hallo. Hallo.
[00:24.560 --> 00:25.840] Jullie zijn met een bus aan het rijden.
[00:25.920 --> 00:26.640] Kijk, ja.
[00:26.720 --> 00:28.120] Op die bus in een boom.
[00:28.200 --> 00:29.200] Wat?

New version:

0:00:05.40,0:00:06.82: Ik heb jullie zeer. Spreek voor jezelf,
0:00:06.84,0:00:09.28: hè. Attention, please. This is Lancelot.
0:00:09.34,0:00:12.66: Clap, clap. Switch seats. Keep quets.
0:00:12.70,0:00:16.73: Perfect. Zoek ik al een boek over België.
0:00:16.81,0:00:22.51: Ik ben de weg kwijt.
0:00:22.55,0:00:25.71: Ik ben het de man. Je ziet het raar,
0:00:25.75,0:00:29.36: jongen. Hallo.
0:00:29.40,0:00:31.30: Hallo. Jullie zijn met een bus aan het
0:00:31.44,0:00:34.00: rijden? Ja. Op die bus in een boom? Wat?

Max Bain · Answer 5 · Fri May 05 2023 02:18:25 GMT+0800 (China Standard Time)

Yes thanks for reporting, I found similar. Unfortunately the natural segments from whisper cannot be extracted in the current batched method.

Definitely the logic for post-processing the 30s chunks into segments needs to be improved. I would suggest the following:

Use nltk toolbox, nltk.sent_tokenize, tokenize the text into sentences (create segment for each sentence)
For long sentences, these can be further broken at comma locations.

Unfortunately my spare time is working on improving diarization right now, but feel free to send a pull request with improvements to this logic

Simon · Answer 6 · Sun May 07 2023 18:34:47 GMT+0800 (China Standard Time)

Anyone working on a better segmentation right now? Otherwise I'd take a look at it.

Max Bain · Answer 7 · Sun May 07 2023 18:39:05 GMT+0800 (China Standard Time)

so actually im on it atm as need sentence segments for diarization (also alignment logic needed cleaning up), will push in an hour or so

Max Bain · Answer 8 · Mon May 08 2023 03:41:39 GMT+0800 (China Standard Time)

Should be hopefully fixed here 24008aa

Sometimes nltk.sent_tokenize can create too short segments, but I found its good. Also improves the diarization

jeybee · Answer 9 · Mon May 08 2023 05:50:25 GMT+0800 (China Standard Time)

Thanks a lot @m-bain, it's much improved. No broken sentences, though I do see the short segments appearing. Could probably be fixed by combining those segments together as a post-processing step, doesn't necessarily need to be in WhisperX.

Here are the results of the same file on the new version:

00:00:05,404 --> 00:00:06,165 Ik heb jullie zeer.
00:00:06,165 --> 00:00:06,985 Spreek voor jezelf, hè.
00:00:06,985 --> 00:00:08,246 Attention, please.
00:00:08,246 --> 00:00:09,347 This is Lancelot.
00:00:09,347 --> 00:00:10,228 Clap, clap.
00:00:10,228 --> 00:00:11,689 Switch seats.
00:00:11,689 --> 00:00:12,709 Keep quets.
00:00:12,709 --> 00:00:15,291 Perfect.
00:00:15,291 --> 00:00:16,812 Zoek ik al een boek over België.
00:00:16,812 --> 00:00:22,557 Ik ben de weg kwijt.
00:00:22,557 --> 00:00:24,978 Ik ben het de man.
00:00:24,978 --> 00:00:26,139 Je ziet het raar, jongen.
00:00:26,139 --> 00:00:29,402 Hallo.
00:00:29,402 --> 00:00:29,782 Hallo.
00:00:29,782 --> 00:00:31,803 Jullie zijn met een bus aan het rijden?
00:00:31,803 --> 00:00:32,064 Ja.
00:00:32,064 --> 00:00:33,685 Op die bus in een boom?
00:00:33,685 --> 00:00:34,005 Wat?

The biggest issue I see now is that each subtitle's end time appears to be the start of the next one even when this isn't accurate.

eg. The "Perfect" line above on the old version ended up at 13.244s which is accurate, and on v3 it stays on-screen over 2 seconds longer until 15.291s.

rockmor · Answer 10 · Tue May 09 2023 01:08:35 GMT+0800 (China Standard Time)

I still have the same problem even in the newest version (tried on ja and zh). Everything comes out in huge chunks.
Python 3.10.11
whisperx --model medium --language ja --compute_type int8 filename.ext

The process goes from
Performing transcription...
right into
Performing alignment...
withouth displaying any timestamps.

shruru · Answer 11 · Tue May 09 2023 01:55:39 GMT+0800 (China Standard Time)

Same here. V3.1. Sentences are too long...

seset · Answer 12 · Tue May 09 2023 15:43:34 GMT+0800 (China Standard Time)

Should be hopefully fixed here 24008aa

Sometimes nltk.sent_tokenize can create too short segments, but I found its good. Also improves the diarization

thanks for the massive code improvement! And after updated i found below:

for Englisth and German,the segements is greatly improved, almost like natural segments from original whisper.
for fr , ja and zh, there is still no much improvement on segements length. especially for ja and zh these kind of syllabary/ideographic characters, i noticed the updated code for eliminate repetitive characters, but the "extra space" between makes the ja and zh segements displays even more longer. so i use --no_align for ja and zh for now.
when i try to use diarization to see whether helpful to segements, errors below, no much luck in using --diarize ever since v3 updated.

File "D:\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\Python\Python310\Scripts\whisperx.exe_main.py", line 7, in
File "D:\Python\Python310\lib\site-packages\whisperx\transcribe.py", line 192, in cli
diarize_model = DiarizationPipeline(use_auth_token=hf_token, device=device)
File "D:\Python\Python310\lib\site-packages\whisperx\diarize.py", line 16, in init
self.model = Pipeline.from_pretrained(model_name, use_auth_token=use_auth_token).to(device)
File "D:\Python\Python310\lib\site-packages\pyannote\pipeline\pipeline.py", line 100, in getattr
raise AttributeError(msg)
AttributeError: 'SpeakerDiarization' object has no attribute 'to'

rockmor · Answer 13 · Wed May 10 2023 01:06:09 GMT+0800 (China Standard Time)

I am having the same problem with English as well.

shruru · Answer 14 · Wed May 10 2023 01:09:25 GMT+0800 (China Standard Time)

Hi @seset

for Englisth and German,the segements is greatly improved, almost like natural segments from original whisper.

you fixed it? We are facing same issue. but not able to fix it....

NielsVandenEynde · Answer 15 · Sat May 13 2023 22:11:58 GMT+0800 (China Standard Time)

You can also just write your own script for merging word level timestamps into sentence level timestamps, if you want I can provide my script

liushaowei123 · Answer 16 · Wed May 17 2023 15:25:01 GMT+0800 (China Standard Time)

Add the following

--max_line_width 42 --max_line_count 2

how to use in code

audio = whisperx.load_audio(audio_file)
print("end load audio")
result = model.transcribe(audio, batch_size=batch_size, max_line_width=42, max_line_count=2)

Omer Arie Lerinman · Answer 17 · Sun Jun 04 2023 17:49:48 GMT+0800 (China Standard Time)

Is there any solution to this problem?
AttributeError: 'SpeakerDiarization' object has no attribute 'to'
I got a failure in here:
diarize_model = DiarizationPipeline(use_auth_token=HF_TOKEN, device=DEVICE)

Thanks!

Simon · Answer 18 · Sun Jun 04 2023 22:48:20 GMT+0800 (China Standard Time)

@Omer-ler are you using the correct pyannote and whisperX versions? Try it in a clean environment maybe.

kronkinatorix · Answer 19 · Tue Jun 06 2023 16:09:55 GMT+0800 (China Standard Time)

I've been tearing my hair out trying to figure out why this is happening. I'm getting the no attrribute "to" error code as well.

I've tried just using pyannote, I've tried with whisperx, I've tried in clean anvironments and I've tried without.

Simon · Answer 20 · Wed Jun 07 2023 01:03:52 GMT+0800 (China Standard Time)

How do you install whisperx? Using a clean environment and running
pip install git+https://github.com/m-bain/whisperx.git
should work.

Summer Wu · Answer 21 · Mon Jun 19 2023 06:36:50 GMT+0800 (China Standard Time)

I'm still getting long segments in the latest, consisting of several sentences per segment rather than one sentence per segment. Is this expected? I've seen this with both English and French transcriptions.

Should be hopefully fixed here 24008aa

Sometimes nltk.sent_tokenize can create too short segments, but I found its good. Also improves the diarization

Max Bain · Answer 22 · Mon Jun 19 2023 06:44:25 GMT+0800 (China Standard Time)

@sw5813 Hmmm if there is a full-stop then this isn't expected (the sentence tokenization should split up multiple sentences). Can you print some examples?

Overall there is a trade-off here: Batch inference provides. a big speedup but loses Whisper's shortert segment timestamps.

For the next big update I will try add functionality to support both ASR backends:

native whisper (unbatched) with original timestamps
batched inference with faster-whisper without original timestamps (can cause 30s long segments, especially for non-english)

Since for some people shorter segments are most important and not speed so 1 might be a better ASR backend.

Summer Wu · Answer 23 · Mon Jun 19 2023 06:49:08 GMT+0800 (China Standard Time)

Sure, here's one of the outputs I got as a result of the transcription (before the alignment step):

{'segments': [
{'text': " qui permettent d'avoir un échantillon beaucoup plus vaste de patientes et d'être éventuellement plus représentatif de ce qu'on va pouvoir avoir finalement dans la vraie vie et avec les patientes qu'on va traiter. Donc c'est des choses qui sont parfaitement, qui peuvent se combiner, c'est deux types d'études totalement différentes.", 'start': 0.008, 'end': 15.483},
{'text': " mais qui vont donner aussi des informations différentes. Donc les deux avec leurs avantages, leurs inconvénients. Donc les résultats pour one hundred and fifty three thousand six hundred femmes qui ont réalisé two hundred and forty five thousand five hundred and thirty four ovarian stimulation. Donc c'est vraiment représentatif de l'AMP française. Donc entre le premier janvier deux mille treize et le trente et un décembre deux mille dix-huit. Et l'âge moyen de ces femmes était de trente-quatre virgule zéro sept ans. Donc ce qui est tout à fait en rapport avec les pratiques.", 'start': 15.483, 'end': 45.47},
{'text': " Le Système National des Données de Santé, ce fameux SNDS, est constitué des données de l'assurance-maladie, en fait, et exhaustif puisqu'on couvre, pardon, de la population au niveau de la France. Donc c'est quelque chose.", 'start': 45.47, 'end': 59.948}
], 'language': 'fr'}

FWIW I used the "suppress_numerals" setting which is why the numbers are written out, although I wonder if that may also be why there's some English that made its way into this French transcription...

seset · Answer 24 · Sat Aug 05 2023 21:34:00 GMT+0800 (China Standard Time)

After few months waiting , Whisperx is still the fastest and best!

My temp solution for verbose segment issue as below:

step 1: install whisperx in editable mode:

$ git clone https://github.com/m-bain/whisperX.git
$ cd whisperX
$ pip install -e .

step 2: fix segement duration problem
edit below line in "asr.py" , change "30" into "8", i tried 5-10 seconds , the length of subs will be all acceptable.
suggest @m-bain to add argument for this...

vad_segments = merge_chunks(vad_segments, 30)

https://github.com/m-bain/whisperX/blame/1b092de19a1878a8f138f665b1467ca21b076e7e/whisperx/asr.py#L263

step3: use "--no_align" to fix extra empty space between when transcribe zh, ja or other language , or to edit "transribe.py" to set it as default. because i don't see that much break lines when not useing alignment, totally acceptable..

seset · Answer 25 · Sat Aug 05 2023 21:38:11 GMT+0800 (China Standard Time)

Hi @seset

for Englisth and German,the segements is greatly improved, almost like natural segments from original whisper.

you fixed it? We are facing same issue. but not able to fix it....

refer above...

陳鈞 · Answer 26 · Wed Aug 30 2023 01:32:37 GMT+0800 (China Standard Time)

After few months waiting , Whisperx is still the fastest and best!

My temp solution for verbose segment issue as below:

step 1: install whisperx in editable mode:

$ git clone https://github.com/m-bain/whisperX.git $ cd whisperX $ pip install -e .

step 2: fix segement duration problem edit below line in "asr.py" , change "30" into "8", i tried 5-10 seconds , the length of subs will be all acceptable. suggest @m-bain to add argument for this...

vad_segments = merge_chunks(vad_segments, 30)

https://github.com/m-bain/whisperX/blame/1b092de19a1878a8f138f665b1467ca21b076e7e/whisperx/asr.py#L263

step3: use "--no_align" to fix extra empty space between when transcribe zh, ja or other language , or to edit "transribe.py" to set it as default. because i don't see that much break lines when not useing alignment, totally acceptable..

@seset The --chunk_size argument is added at #445. Please check if this issue is resolved.
And the ja, zh space issue is also resolved #248.