pratikanubhav / IWR

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Handwritten Text Recognition with TensorFlow

Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. This Neural Network (NN) model recognizes the text contained in the images of segmented words as shown in the illustration below.

Run demo

Go to the model/ directory and unzip the file model.zip (pre-trained on the IAM dataset). Take care that the unzipped files are placed directly into the model/ directory and not some subdirectory created by the unzip-program. Afterwards, go to the src/ directory and run python main.py. The input image and the expected output is shown below.

> python main.py
Validation character error rate of saved model: 10.624916%
Init with stored values from ../model/snapshot-38
Recognized: "little"
Probability: 0.96625507

Tested with:

  • Python 2 and Python 3
  • TF 1.3, 1.10 and 1.12

Command line arguments

  • --train: train the NN, details see below.
  • --validate: validate the NN, details see below.
  • --beamsearch: use vanilla beam search decoding (better, but slower) instead of best path decoding.
  • --wordbeamsearch: use word beam search decoding (only outputs words contained in a dictionary) instead of best path decoding.

Information about model

Overview

The model [1] is a stripped-down version of the HTR system I implemented for my thesis [2][3]. What remains is what I think is the bare minimum to recognize text with an acceptable accuracy. The implementation only depends on numpy, cv2 and tensorflow imports. It consists of 5 CNN layers, 2 RNN (LSTM) layers and the CTC loss and decoding layer. The illustration below gives an overview of the NN (green: operations, pink: data flowing through NN) and here follows a short description:

  • The input image is a gray-value image and has a size of 128x32
  • 5 CNN layers map the input image to a feature sequence of size 32x256
  • 2 LSTM layers with 256 units propagate information through the sequence and map the sequence to a matrix of size 32x80. Each matrix-element represents a score for one of the 80 characters at one of the 32 time-steps
  • The CTC layer either calculates the loss value given the matrix and the ground-truth text (when training), or it decodes the matrix to the final text with best path decoding or beam search decoding (when inferring)
  • Batch size is set to 50

About

License:MIT License


Languages

Language:Python 100.0%