Stabilizing Timestamps for Whisper

Description

This script modifies and adds more robust decoding logic on top of OpenAI's Whisper to produce more accurate segment-level timestamps and obtain to word-level timestamps without extra inference.

Setup

pip install -U stable-ts

To install the lastest commit:

pip install -U git+https://github.com/jianfch/stable-ts.git

Command-line usage

Transcribe audio then save result as JSON file.

stable-ts audio.mp3 -o audio.json

Processing JSON file of the results into ASS.

stable-ts audio.json -o audio.ass

Transcribe multiple audio files then process the results directly into SRT files.

stable-ts audio1.mp3 audio2.mp3 audio3.mp3 -o audio1.srt audio2.srt audio3.srt

Show all available arguments and help.

stable-ts -h

Python usage

import stable_whisper

model = stable_whisper.load_model('base')
# modified model should run just like the regular model but accepts additional parameters
results = model.transcribe('audio.mp3')

jfk_segment.mp4

# the above uses default settings on version 1.1 with large model
# sentence/phrase-level
stable_whisper.results_to_sentence_srt(results, 'audio.srt')

jfk_word_segments.mp4

# the above uses default settings on version 1.1 with large model
# sentence/phrase-level & word-level
stable_whisper.results_to_sentence_word_ass(results, 'audio.ass')

Additional Info

Although timestamps are chronological, they can still very inaccurate depending on the model, audio, and parameters.
To produce production ready word-level results, the model needs to be fine-tuned with high quality dataset of audio with word-level timestamp.

License

This project is licensed under the MIT License - see the LICENSE file for details

Acknowledgments

Includes slight modification of the original work: Whisper

About

Stabilizing timestamps of OpenAI's Whisper outputs down to word-level

MIT License

Languages

Language:Python 100.0%