Stabilizing Timestamps for Whisper

This script modifies OpenAI's Whisper to produce more reliable timestamps.

a.mp4

Setup
Usage
Quick 1.X → 2.X Guide

Setup

pip install -U stable-ts

To install the latest commit:

pip install -U git+https://github.com/jianfch/stable-ts.git

Usage

Transcribe

import stable_whisper
model = stable_whisper.load_model('base')
result = model.transcribe('audio.mp3')
result.to_srt_vtt('audio.srt')

CLI

stable-ts audio.mp3 -o audio.srt

Parameters: load_model(), transcribe(), transcribe_minimal()

faster-whisper

Use with faster-whisper:

model = stable_whisper.load_faster_whisper('base')
result = model.transcribe_stable('audio.mp3')

Parameters: transcribe_stable(),

Output

Stable-ts supports various text output formats.

result.to_srt_vtt('audio.srt') #SRT
result.to_srt_vtt('audio.vtt') #VTT
result.to_ass('audio.ass') #ASS
result.to_tsv('audio.tsv') #TSV

Parameters: to_srt_vtt(), to_ass(), to_tsv()

There are word-level and segment-level timestamps. All output formats support them. They also support will both levels simultaneously except TSV. By default, segment_level and word_level are both True for all the formats that support both simultaneously.

Examples in VTT.

Default: segment_level=True + word_level=True

CLI

--segment_level true + --word_level true

00:00:07.760 --> 00:00:09.900
But<00:00:07.860> when<00:00:08.040> you<00:00:08.280> arrived<00:00:08.580> at<00:00:08.800> that<00:00:09.000> distant<00:00:09.400> world,

segment_level=True + word_level=False

00:00:07.760 --> 00:00:09.900
But when you arrived at that distant world,

segment_level=False + word_level=True

00:00:07.760 --> 00:00:07.860
But

00:00:07.860 --> 00:00:08.040
when

00:00:08.040 --> 00:00:08.280
you

00:00:08.280 --> 00:00:08.580
arrived

...

JSON

The result can also be saved as a JSON file to preserve all the data for future reprocessing. This is useful for testing different sets of postprocessing arguments without the need to redo inference.

result.save_as_json('audio.json')

CLI

stable-ts audio.mp3 -o audio.json

Processing JSON file of the results into SRT.

result = stable_whisper.WhisperResult('audio.json')
result.to_srt_vtt('audio.srt')

CLI

stable-ts audio.json -o audio.srt

Alignment

Audio can be aligned/synced with plain text on word-level.

text = 'Machines thinking, breeding. You were to bear us a new, promised land.'
result = model.align('audio.mp3', text)

When the text is correct but the timestamps need more work, align() is a faster alternative for testing various settings/models.

new_result = model.align('audio.mp3', result)

Parameters: align()

Adjustments

Timestamps are adjusted after the model predicts them. When suppress_silence=True (default), transcribe()/transcribe_minimal()/align() adjust based on silence/non-speech. The timestamps can be further adjusted base on another result with adjust_by_result(), which acts as a logical AND operation for the timestamps of both results, further reducing duration of each word. Note: both results are required to have word timestamps and matching words.

# the adjustments are in-place for `result`
result.adjust_by_result(new_result)

Parameters: adjust_by_result()

Regrouping Words

Stable-ts has a preset for regrouping words into different segments with more natural boundaries. This preset is enabled by regroup=True (default). But there are other built-in regrouping methods that allow you to customize the regrouping algorithm. This preset is just a predefined combination of those methods.

xata.mp4

# The following results are all functionally equivalent:
result0 = model.transcribe('audio.mp3', regroup=True) # regroup is True by default
result1 = model.transcribe('audio.mp3', regroup=False)
(
    result1
    .clamp_max()
    .split_by_punctuation([('.', ' '), '。', '?', '？', (',', ' '), '，'])
    .split_by_gap(.5)
    .merge_by_gap(.3, max_words=3)
    .split_by_punctuation([('.', ' '), '。', '?', '？'])
)
result2 = model.transcribe('audio.mp3', regroup='cm_sp=.* /。/?/？/,* /，_sg=.5_mg=.3+3_sp=.* /。/?/？')

# To undo all regrouping operations:
result0.reset()

Any regrouping algorithm can be expressed as a string. Please feel free share your strings here

Regrouping Methods

Locating Words

You can locate words with regular expression.

# Find every sentence that contains "and"
matches = result.find(r'[^.]+and[^.]+\.')
# print the all matches if there are any
for match in matches:
  print(f'match: {match.text_match}\n'
        f'text: {match.text}\n'
        f'start: {match.start}\n'
        f'end: {match.end}\n')
  
# Find the word before and after "and" in the matches
matches = matches.find(r'\s\S+\sand\s\S+')
for match in matches:
  print(f'match: {match.text_match}\n'
        f'text: {match.text}\n'
        f'start: {match.start}\n'
        f'end: {match.end}\n')

Parameters: find()

Tips

do not disable word timestamps with word_timestamps=False for reliable segment timestamps
use vad=True for more accurate non-speech detection
use demucs=True to isolate vocals with Demucs; it is also effective at isolating vocals even if there is no music
use demucs=True and vad=True for music
--dq true or dq=True for stable_whisper.load_model to enable dynamic quantization for inference on CPU
use encode_video_comparison() to encode multiple transcripts into one video for synced comparison; see Encode Comparison
use visualize_suppression() to visualize the differences between non-VAD and VAD options; see Visualizing Suppression
if the non-speech/silence seems to be detected but the starting timestamps do not reflect that, then try min_word_dur=0

Visualizing Suppression

You can visualize which parts of the audio will likely be suppressed (i.e. marked as silent). Requires: Pillow or opencv-python.

Without VAD

import stable_whisper
# regions on the waveform colored red are where it will likely be suppressed and marked as silent
# [q_levels]=20 and [k_size]=5 (default)
stable_whisper.visualize_suppression('audio.mp3', 'image.png', q_levels=20, k_size = 5)

With Silero VAD

# [vad_threshold]=0.35 (default)
stable_whisper.visualize_suppression('audio.mp3', 'image.png', vad=True, vad_threshold=0.35)

Parameters: visualize_suppression()

Encode Comparison

You can encode videos similar to the ones in the doc for comparing transcriptions of the same audio.

stable_whisper.encode_video_comparison(
    'audio.mp3', 
    ['audio_sub1.srt', 'audio_sub2.srt'], 
    output_videopath='audio.mp4', 
    labels=['Example 1', 'Example 2']
)

Parameters: encode_video_comparison()

Multiple Files with CLI

Transcribe multiple audio files then process the results directly into SRT files.

stable-ts audio1.mp3 audio2.mp3 audio3.mp3 -o audio1.srt audio2.srt audio3.srt

Any ASR

You can use most of the features of Stable-ts improve the results of any ASR model/APIs. Just follow this notebook.

Quick 1.X → 2.X Guide

What's new in 2.0.0?

updated to use Whisper's more reliable word-level timestamps method.
the more reliable word timestamps allow regrouping all words into segments with more natural boundaries.
can now suppress silence with Silero VAD (requires PyTorch 1.12.0+)
non-VAD silence suppression is also more robust

Usage changes

results_to_sentence_srt(result, 'audio.srt') → result.to_srt_vtt('audio.srt', word_level=False)
results_to_word_srt(result, 'audio.srt') → result.to_srt_vtt('output.srt', segment_level=False)
results_to_sentence_word_ass(result, 'audio.srt') → result.to_ass('output.ass')
there's no need to stabilize segments after inference because they're already stabilized during inference
transcribe() returns a WhisperResult object which can be converted to dict with .to_dict(). e.g result.to_dict()

License

This project is licensed under the MIT License - see the LICENSE file for details

Acknowledgments

Includes slight modification of the original work: Whisper

moisestohias / stable-ts