soroush/tesseract_box_merge

Synopsis

box_merge is a utility tool for tesseract OCR engine training process. This tool merges boxes of Arabic script, corresponding to their join class specified in the Unicode Standard.

Usage

Call box_merge with this arguments:

box_merge -i INPUT_BOX_FILE -o OUTPUT_FILE -u ARABIC_SHAPING

INPUT_FILE The input box file

OUTPUT_FILE The output box file.

ARABIC_SHAPING This file is a part of UCD that describes joining classes of all characters in range 9-3, 9-8, 9-9, 9-10, 9-14, 9-15, 9-16, 9-19, 9-20, 10-4, 10-5, 10-6, 10-7, and 19-5 of The Unicode Standard core specification. This file (Usually named ArabicShaping.txt) is available in UCD.

Motivation

Tesseract OCR is an excellent OCR engine, providing custom languages, fonts and tools for generating trained data. Unfortunately tesseract training tools come short for complex scripts like Arabic. By its nature, Arabic language is written from right to left, and follows a joining mechanism. Ignoring these features, leads us to an unsufficient toolset for training tesseract OCR data for Persian, Arabic, Indic, etc.

This project provides a tool to enhance .box files generated by text2image. Following changes will be applied after running box_merge:

The order of box entries will be reversed to meet RTL language needs. (Basically correct merging);
Box Definition entries in the input file will be merged together following Unicode Annex #44 joining classes.

Installation

Simple steps for compiling and installing box_merge:

./configure
make
make install

Contributors

Please report bugs, and suggest improvements in issues section.

In case you wish to contribute to this project, please send an email to soroush@ametisco.ir.

soroush / tesseract_box_merge

Synopsis

Usage

Motivation

Installation

Contributors

License

About

Languages