Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. This Neural Network (NN) model recognizes the text contained in the images of segmented words as shown in the illustration below.
Go to the model/
directory and unzip the file model.zip
(pre-trained on the IAM dataset).
Take care that the unzipped files are placed directly into the model/
directory and not some subdirectory created by the unzip-program.
Afterwards, go to the src/
directory and run python main.py
.
The input image and the expected output is shown below.
> python main.py
Validation character error rate of saved model: 10.624916%
Init with stored values from ../model/snapshot-38
Recognized: "little"
Probability: 0.96625507
Tested with:
- Python 2 and Python 3
- TF 1.3, 1.10 and 1.12
--train
: train the NN, details see below.--validate
: validate the NN, details see below.--beamsearch
: use vanilla beam search decoding (better, but slower) instead of best path decoding.--wordbeamsearch
: use word beam search decoding (only outputs words contained in a dictionary) instead of best path decoding.
The model [1] is a stripped-down version of the HTR system I implemented for my thesis [2][3]. What remains is what I think is the bare minimum to recognize text with an acceptable accuracy. The implementation only depends on numpy, cv2 and tensorflow imports. It consists of 5 CNN layers, 2 RNN (LSTM) layers and the CTC loss and decoding layer. The illustration below gives an overview of the NN (green: operations, pink: data flowing through NN) and here follows a short description:
- The input image is a gray-value image and has a size of 128x32
- 5 CNN layers map the input image to a feature sequence of size 32x256
- 2 LSTM layers with 256 units propagate information through the sequence and map the sequence to a matrix of size 32x80. Each matrix-element represents a score for one of the 80 characters at one of the 32 time-steps
- The CTC layer either calculates the loss value given the matrix and the ground-truth text (when training), or it decodes the matrix to the final text with best path decoding or beam search decoding (when inferring)
- Batch size is set to 50