This repository uses the architecture proposed in "What You Get Is What You See: A Visual Markup Decompiler" (http://arxiv.org/pdf/1609.04938v1.pdf) to the problem of Handwriting Recognition. The base implementation was done in Tensorflow by ritheshkumar95/im2latex-tensorflow (forked) and was modified to work for Handwriting Recognition. The original Torch implementation of the paper is located here: https://github.com/harvardnlp/im2markup/blob/master/
What You Get Is What You See: A Visual Markup Decompiler
Yuntian Deng, Anssi Kanervisto, and Alexander M. Rush
http://arxiv.org/pdf/1609.04938v1.pdf
This deep learning framework can be used to learn a representation of an image. In this case, our input image is an image of text and we are converting this image to an ASCII representation.
Below is an example of an input image of text:
The goal is to infer the following ASCII text:
MOVE
attention.py: File that is run for training and testing
data_loaders: File that is called by attention.py to load data files
tflib/: Contains network.py and ops.py which contain the CNN and LSTM architectures implemented in Tensorflow.
scripts/: Contains scripts needed to preprocess data
images/: Contains image data
baseline_model/: Contains code from our baseline and milestone models
att_imgs: Contains images with a visualization of attention
We obtained our dataset from the IAM Handwriting Database 3.0 (http://www.fki.inf.unibe.ch/databases/iam-handwriting-database/download-the-iam-handwriting-database). A sample of these images and directory structure is included in this repo in the images
folder. Follow the steps below to preprocess the image data.
-
Download the words dataset from the IAM Handwriting Database and place the words.txt file in the
data
folder. -
Run the parse raw data script and place the
images_path_label.csv
file that is created in theimages
folder.
python scripts/parse_raw_data.py images/data/words.txt
- Resize all images to have a width of 120 pixels and a height of 50 pixels.
python scripts/resize_images.py images/images_path_label.csv images/
- Preprocess images by cropping out whitespace
python scripts/preprocessing/preprocess_images_handwriting.py --input-dir images/data --output-dir images/processed
- Create labels file called labels.norm.lst that contains pipe ("|") separated characters of the ASCII convert of the corresponding image in images_path_label.csv.
python scripts/preprocessing/preprocess_labels_handwriting.py images/image_path_file.csv images/
- Filter images into a train.lst, test.lst, and valid.lst. Move these files to
images/
python scripts/preprocessing/preprocess_filter_handwriting.py
- Lastly create train, test, and valid buckets to be read from when training.
python scripts/preprocessing/create_buckets.py train
python scripts/preprocessing/create_buckets.py test
python scripts/preprocessing/create_buckets.py valid
Now, we are finally ready to train our model. You can do this by running:
python attention.py
Default hyperparameters used:
- BATCH_SIZE = 16
- EMB_DIM = 60
- ENC_DIM = 256
- DEC_DIM = ENC_DIM*2
- D = 512 (#channels in feature grid)
- V = 502 (vocab size)
- NB_EPOCHS = 50
- H = 20 (Maximum height of feature grid)
- W = 50 (Maximum width of feature grid)
You can use the following flags to set additional hyperparameters:
- --lr: learning rate
- --decay_rate: decay_rate
- --num_epochs: number of epochs
- --num_iterations: number of iterations
- --optimizer: type of optimizer (sgd, adam, rmsprop)
- --batch_size: batch size
- --embedding_size: embedding size
predict() function in the attention.py script can be called to predict from validation or test sets. If you call this function with visualization turned on, it will save images with an indication of where attention was placed for a certain character.
"m"
"i"
"g"
"h"
"t"
"#END”
The code for our baseline and milestone models can be found in the folder baseline_model.