OCR-Engine
Overview
Thanks to open source, it's not really difficult to build a customed OCR engine. In this repository, we'll do something below.
- download and pre-process dataset for training model
- try training text detection model and text recognization model respectively
- custom your own OCR-Engine
- serve your model on web
In Google OCR Service Paper, we can get a hint of building OCR Engine.
In this repository, we will leave Direction ID
, Script ID
and Layout Analysis
parts empty.
Getting Started
1. Data Generation
First of all, we need to prepare training data. If you don't have good quality data, you can generate one. There are three steps to go.
A) Collect corpus.
Locate your corpus in ./generate_data/texts/
directory. This corpus will be tokenized and renderd in the images of dataset. So, it would be best to gather corpus in target domain.
I recommend you to prepare more than 1MB of corpus as .txt
file.
B) Collect fonts.
Locate your font files in ./generate_data/fonts/<lang>/
directory. The extension of font files should be .otf
or .ttf
. Separate fonts by languages. If your language is English the <lang>
folder can be en
.
C) Generate line data.
We will generate line image like below and .pkl
files which contains location of every character in the image. A pkl
file is created for each image. Additionally, total ground truth data will be generated in gt.pkl
file.
This line data is ingredients for making paragraph dataset. (see step D))
> cd generate_data
> python run.py -i texts/my-corpus.txt \
-l ko -nd -c 10000 -f 200 \
-rs -w 20 -t 1 -bl 2 -rbl -k 1 -rk -na 2 \
--output_dir out/line
-i
: input corpus-l
: language of fonts (language name ingenerate_data/fonts
directory)-c
: number of lines to be used for generating data- You can check all options in
generate_data.py
+) If you put --bbox
option, you can visualize the bounding box of all characters. The image samples below are include bounding box visualization. You shouldn't put this option for training data.
D) Merge line data to paragraph.
To train text detection model, we will merge line data which we already generated above to paragraph. You can use merge_lines.py
code in generate_dataset
directory.
> cd generate_data
> python merge_lines.py -o vertical -b out/line --width 2000 --height 1000 --min 1 --max 5
then, you will get paragraph data and out/line/combined/merged_gt.pkl
data below.
E) Crop word data.
To train text recognition model, we will generate word data by cropping paragraph data which we made in C).
> cd generate_data
> python crop_words.py --pickle out/lines/combined/merged_gt.pkl \
--image_dir out/lines/combined --output_dir out/words
Then you can get word-level-splited cropped data. The total data would be located in out/words/gt.pkl
with command above.
Okay. We finished preparing dataset for training.
2. Train Text Detection Model
There are several hyper-parameters of text detection model in settings/default.yaml
. I don't recommend you to edit them without knowledge of specific element.
> python train.py -m detector \
--data_path generate_data/out/line/combined/merged_gt.pkl \
--version 0 --batch_size 4 --learning_rate 5e-5 \
--max_epoch 100 --num_workers 4
To monitor the training progress, use tensorboard.
> tensorboard --logdir tb_logs
3. Train Text Recognition Model
Text Recognizer also has some hyper-parameters. Thanks to deep-text-recognition-benchmark, It's really easy to change parts which consist recognzier.
Modules
Transformation
: select Transformation module [None | TPS].
FeatureExtraction
: select FeatureExtraction module [VGG | RCNN | ResNet].
SequenceModeling
: select SequenceModeling module [None | BiLSTM].
Prediction
: select Prediction module [CTC | Attn].
> python train.py -m recognizer --data_path generate_data/out/words/gt.pkl \
--version 1 --batch_size 64 --learning_rate 1.0 --max_epoch 100 --num_workers 4
You need to train the model more than 15k total iteration
.
iteration per one epoch
= train_data_size
/ ( batch_size
* num_gpu
)
total iteration
= iteration per one epoch
* total epoch
You can monitor the training progress with tensorboard as well.
> tensorboard --logdir tb_logs
In the log screenshot, accuracy calculated by exact match cases.
4. Serve OCR Engine with API
Okay, It's time to deploy your OCR-Engine. Before run API server, let's modify some hyper-parameters for prediction stage. Decreasing each thresholds would be better for most test cases.
# settings/default.yaml
craft:
THRESHOLD_WORD : 0.4
THRESHOLD_CHARACTER: 0.4
THRESHOLD_AFFINITY: 0.2
Then, start API server with demo.py
. Specify each checkpoints you trained with parameters.
> python demo.py --host 127.0.0.1 --port 5000 \
--detector_ckpt <detector checkpoint path> \
--recognizer_ckpt <recognizer checkpoint path> \
--vocab vocab.txt
Then your OCR-Engine server has been started.
You can send API request by using request.py
.
> python request.py <img path>
Then you will get text and coordinate by response.