jacobped / tv-subtitle-ocr

Subtitle Extraction on TV broadcasts using OCR processing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Subtitle Extraction

This project is created in collaboration with The Royal Library of Denmark, and is about extraction of subtitles from tv program broadcast video material stored in MedieSteam.

About

The program has a multitude of different ways to be executed. One way that it can operate, is that it can get a video file uuid from an index(Apache Solr), and extract frames.

These frames are then pre-processed by merging and manipulating them for a better result when finding contours in the frame. The estimated text contours are analysed for presumed subtitles. When the subtitle is estimated, the image is adjusted and cropped to the best fitting size.

This leaves the best possible chances for the OCR-process to read the subtitles correctly.

The OCR-proces leaves a text result, which needs to be post-processed for noise and wild characters.

When the OCR-proces finshed reading all the subtitles and the text is post-processed, a SRT file is generated. This SRT is sent to DOMS, ready to use or further processed.

For further information about the workflow in the individual modules, we will refer to the readme in each module.

Module overview

Getting started

Prerequisites

What do you need to install for this program to work and how to install them.

Dependencies

Dependency Tested with version
Java 8 1.8.0_161-b12
Tesseract 4 4.00.00alpha
FFmpeg 3.4 3.4.2

Installing dependencies

Step by step installation guide based on the dependencies for the project.

Tesseract 4

Tesseract can be installed by the following guide-links:

Tesseract Installing

Tesseract Compiling

Setting up Tesseract

Tesseract is using a few different files, that we need to ensure are being set up:

TESSDATA

Tessdata needs to be set with the enviroment variable: "TESSDATA_PREFIX" (optional: can be overwritten thorugh the config file.)
Example for Linux, run in terminal or add to .bashrc:

export TESSDATA_PREFIX=/PATH/TO/TESSDATA/

Within the folder there must be a language file, matching the language name defined in this applications config file. The data files will probably have a name formatted as follows: LANGUAGE.traineddata where LANGUAGE is the alpha 3 code for a language.
Example for danish in tessdata directory : dan.traineddata
Example for danish in config.properties : dan
Pre trained language data can be found on tesseracts official github

LD_LIBRARY_PATH

In some instances, LD_LIBRARY_PATH might not have been defined in the system.

It can, depending on the system, be set as following:

Create config file at path: "/etc/ld.so.conf.d/.conf".

The content of file will be: "/usr/local/lib".

Then save and run the following command to have it take affect:

sudo ldconfig

FFmpeg

FFmpeg needs to be installed for the program to work as well. Most linux distributions got it preinstalled. FFmpeg can be acquired from the following link:

FFmpeg

Most of the packages contains both ffmpeg and ffprobe. The program need both of these to be installed to run.

FFmpeg standard path:
"/usr/bin/ffmpeg"

FFprobe standard path:
"/usr/bin/ffmprobe"

Both of these paths need to set in the programs config file.

Running the program

There are a number of different ways to run the program. But first and foremost, it needs to be packaged into the final executable program with mvn package in the root module. This allows the program to be executed outside a development environment.

This will give the file "main/target/main-xxxx-package.tar.gz", which will include a production config in its config directory, that will need to have it's settings defined.
When the packaging is done, the program can be executed. There are are multiple ways to execute the program inside the main/target/main-XXX/bin directory.

Shell file Purpose Arguments
runProcessFileListToDoms.sh Takes a list of UUIDs and processes them to SRT files on Doms -in: Input file with UUIDs -f: Force override of SRT, if a subtitle is already set
runProcessFileListToFiles.sh Takes a list of UUIDs and processes them to SRT files on disk -in: Input file with UUIDs -f: Force override of SRT, if a subtitle is already set
runProcessIndexToDoms.sh Processes videos from Solr Index. Processes 50 videos at a time. If no argument is set, processes all videos. Output is put on Doms -m: Optional - Max amount of files to process, will run in till this is met
runProcessIndexToFileList.sh Fetches videos with the SRT flag set. If no max is set, fetches all videos -out: Output file, where to write the files with SRT file. -m: Max amount of UUIDs to fetch
runProcessIndexToFiles.sh Processes videos from Solr Index. Processes 50 videos at a time. If no argument is set, processes all videos. Output is put in disk -m: Optional - Max amount of files to process, will run in till this is met

Developing the program

The dependencies are now done and before the program is ready to be worked on and tested, we need to configure it. This is done in test-resources/src/main/resources/config.properties, that should be created based on the config.properties.SAMPLE file.

main/src/main/config/config.properties, that should be created based on the config.properties.SAMPLE file.

If there is a need to experiment and test the program, remember to copy the config.properties files to test-resources/src/main/resources/config.properties.

config.properties explanation

As of 2018-03-23, the config.properties look like this:

Property Function Required
sharedWorkDir Path to directory where the program can put temporary and output files yes
videoSourceDir Input directory for videos with UUIDs (as on Doms) yes
workerThreads=2 How many threads the program should run with yes
debug=false If the program should run in debug mode or not yes
# FrameExtraction
ffmpegPath=/usr/bin/ffmpeg Path to ffmpeg yes
ffprobePath=/usr/bin/ffprobe Path to ffprobe yes
# OCRProcessorFactory
tesseractDataFolderPath Path to tessdata directory. Not required if TESSDATA is set no
tesseractPageSegmentation=6 See commandline usage yes
tesseractOCREngineMode=3 See commandline usage yes
tesseractTrainedDataLanguage=dan Trained language data yes
Solr - Index
indexServerUrl= Solr URL for indexing yes
Fedora Repository - Doms
domsServerAddress URL address for Doms Repository yes
domsUserName Username for Doms repository yes
domsPassword Password for Doms repository yes

VisualVM

When using VisualVM to debug the application, remember to add the following parameter as Java VM options when launching this program:

-Xverify:none

Authors

Programmers:

  • Andreas Reng Mogensen
  • Jacob Pedersen
  • Silas Jeppe Christensen

Supervisors associated with the Royal Danish Library - Aarhus department:

  • Kim Teglgaard Christensen
  • Kåre Fiedler Christiansen
  • Asger Askov Blekinge

Last updated: 2018-03-26T10:19

About

Subtitle Extraction on TV broadcasts using OCR processing

License:MIT License


Languages

Language:Java 99.7%Language:Shell 0.3%