OCR-Engine

Overview

Thanks to open source, it's not really difficult to build a customed OCR engine. In this repository, we'll do something below.

download and pre-process dataset for training model
try training text detection model and text recognization model respectively
custom your own OCR-Engine
serve your model on web

In Google OCR Service Paper, we can get a hint of building OCR Engine.

In this repository, we will leave Direction ID, Script ID and Layout Analysis parts empty.

Getting Started

1. Data Generation

First of all, we need to prepare training data. If you don't have good quality data, you can generate one. There are three steps to go.

A) Collect corpus.

Locate your corpus in ./generate_data/texts/ directory. This corpus will be tokenized and renderd in the images of dataset. So, it would be best to gather corpus in target domain. I recommend you to prepare more than 1MB of corpus as .txt file.

Get Corpus

B) Collect fonts.

Locate your font files in ./generate_data/fonts/<lang>/ directory. The extension of font files should be .otf or .ttf. Separate fonts by languages. If your language is English the <lang> folder can be en.

Get Fonts

C) Generate line data.

We will generate line image like below and .pkl files which contains location of every character in the image. A pkl file is created for each image. Additionally, total ground truth data will be generated in gt.pkl file.

This line data is ingredients for making paragraph dataset. (see step D))

> cd generate_data
> python run.py -i texts/my-corpus.txt \
    -l ko -nd -c 10000 -f 200 \
    -rs -w 20 -t 1 -bl 2 -rbl -k 1 -rk -na 2 \
    --output_dir out/line

-i : input corpus
-l : language of fonts (language name in generate_data/fonts directory)
-c : number of lines to be used for generating data
You can check all options in generate_data.py

+) If you put --bbox option, you can visualize the bounding box of all characters. The image samples below are include bounding box visualization. You shouldn't put this option for training data.

D) Merge line data to paragraph.

To train text detection model, we will merge line data which we already generated above to paragraph. You can use merge_lines.py code in generate_dataset directory.

> cd generate_data
> python merge_lines.py -o vertical -b out/line --width 2000 --height 1000 --min 1 --max 5

then, you will get paragraph data and out/line/combined/merged_gt.pkl data below.

E) Crop word data.

To train text recognition model, we will generate word data by cropping paragraph data which we made in C).

> cd generate_data
> python crop_words.py --pickle out/lines/combined/merged_gt.pkl \
    --image_dir out/lines/combined --output_dir out/words

Then you can get word-level-splited cropped data. The total data would be located in out/words/gt.pkl with command above.

Okay. We finished preparing dataset for training.

2. Train Text Detection Model

There are several hyper-parameters of text detection model in settings/default.yaml. I don't recommend you to edit them without knowledge of specific element.

> python train.py -m detector \
    --data_path generate_data/out/line/combined/merged_gt.pkl \
    --version 0 --batch_size 4 --learning_rate 5e-5 \
    --max_epoch 100 --num_workers 4

To monitor the training progress, use tensorboard.

> tensorboard --logdir tb_logs

3. Train Text Recognition Model

Text Recognizer also has some hyper-parameters. Thanks to deep-text-recognition-benchmark, It's really easy to change parts which consist recognzier.

Modules

> python train.py -m recognizer --data_path generate_data/out/words/gt.pkl \
    --version 1 --batch_size 64 --learning_rate 1.0 --max_epoch 100 --num_workers 4

You need to train the model more than 15k total iteration.

iteration per one epoch = train_data_size / ( batch_size * num_gpu )
total iteration = iteration per one epoch * total epoch

You can monitor the training progress with tensorboard as well.

> tensorboard --logdir tb_logs

In the log screenshot, accuracy calculated by exact match cases.

4. Serve OCR Engine with API

Okay, It's time to deploy your OCR-Engine. Before run API server, let's modify some hyper-parameters for prediction stage. Decreasing each thresholds would be better for most test cases.

# settings/default.yaml

craft:
    THRESHOLD_WORD : 0.4
    THRESHOLD_CHARACTER: 0.4
    THRESHOLD_AFFINITY: 0.2

Then, start API server with demo.py. Specify each checkpoints you trained with parameters.

> python demo.py --host 127.0.0.1 --port 5000 \ 
    --detector_ckpt <detector checkpoint path> \
    --recognizer_ckpt <recognizer checkpoint path> \
    --vocab vocab.txt

Then your OCR-Engine server has been started.

You can send API request by using request.py.

> python request.py <img path>

Then you will get text and coordinate by response.

sokcuri / Open-OCR-Engine