Hard-burned subtitles OCR to SRT extractor

Apple Silicon M1/M2 toolchain for extracting .SRT subtitles from movies with embedded hard-burned subtitles
the OCR step is using a modified version of macOCR (forked from https://github.com/xulihang/macOCR); the macos Apple Silicon ARM64 binary is included in the repo as OCR

The workflow sequence run by the do-all.sh script:

Generate cropped video with ffmpeg (you'll have to adjust the crop area for your video size)
Generate PNG snapshots (using ffmpeg ... fps=1 — 1 snapshot per second)
Optical Character Recognition using macOCR (Apple Silicon only) outputs JSON file.
Convert JSON to SRT + normalize and deduplicate using https://github.com/cdown/srt.
optional: Generate Chinese pinyin and traditional/simplified versions.
optional: Translate with deepl.
optional: Merge translation into the final SRT containing Hanzi Simplified + Hanzi Traditional + Pinyin + English.

NOTE

this collection of scripts is work in progress and will require tweaking for each specific scenario (the corresponding places that need editing are marked with TODO comments in the code); use at your own risk

Hard-burned subtitles OCR to SRT extractor

Apache License 2.0

Language:Python 91.1%Language:Shell 8.9%