box_merge is a utility tool for tesseract OCR engine training process. This tool merges boxes of Arabic script, corresponding to their join class specified in the Unicode Standard.
Call box_merge
with this arguments:
box_merge -i INPUT_BOX_FILE -o OUTPUT_FILE -u ARABIC_SHAPING
INPUT_FILE
The input box file
OUTPUT_FILE
The output box file.
ARABIC_SHAPING
This file is a part of UCD that describes joining classes of
all characters in range 9-3, 9-8, 9-9, 9-10, 9-14, 9-15, 9-16, 9-19, 9-20, 10-4,
10-5, 10-6, 10-7, and 19-5 of The Unicode Standard core specification. This file
(Usually named ArabicShaping.txt
) is available in
UCD.
Tesseract OCR is an excellent OCR engine, providing custom languages, fonts and tools for generating trained data. Unfortunately tesseract training tools come short for complex scripts like Arabic. By its nature, Arabic language is written from right to left, and follows a joining mechanism. Ignoring these features, leads us to an unsufficient toolset for training tesseract OCR data for Persian, Arabic, Indic, etc.
This project provides a tool to enhance .box
files generated by
text2image
. Following changes will be applied after running box_merge
:
- The order of box entries will be reversed to meet RTL language needs. (Basically correct merging);
- Box Definition entries in the input file will be merged together following Unicode Annex #44 joining classes.
Simple steps for compiling and installing box_merge
:
./configure
make
make install
Please report bugs, and suggest improvements in issues section.
In case you wish to contribute to this project, please send an email to soroush@ametisco.ir.
This software is licenced under GNU GENERAL PUBLIC LICENSE Version 3. All rights reserved.