Air-Script is a CNN + Sequence to Sequence model for detecting handwriting on air using a Myo-Armband. It is Inspired by ‘Recursive Recurrent Nets with Attention Modeling for OCR in the Wild’ by Chen-Yu & Simon, 2016. The idea was to use 1D-CNNs as feature extractors and a sequence to sequence model with Attention mechanism introduced by Bahdanau et al., 2014 using LSTMs for variable length sequence classification.

The Implementation of Attention-OCR was extremely helpful and Air-Script was built upon it. This project comes does not give results as expected and is currently under development. Probable issues have been tracked a list of further tasks have been mentioned at the end of this document and is being updated regularly too.

Prerequsites

Tensorflow (Version 0.11.0)
Keras (Version 1.1.1)
Distance (Optional)
Python 2.7

I have tested it on an Ubuntu 14.04 and 15.04 with NVIDIA GeForce GT 740M and NVIDIA TITAN X Graphics card with Tensorflow running in a virtual environment. It should ideally run smoothly on any other system with the above packages installed in it.

Set Keras backend:

export KERAS_BACKEND=tensorflow

echo 'export KERAS_BACKEND=tensorflow' >> ~/.bashrc

Install Distance (Optional):

wget http://www.cs.cmu.edu/~yuntiand/Distance-0.1.3.tar.gz

tar zxf Distance-0.1.3.tar.gz

cd distance; sudo python setup.py install

The Idea

The idea was to learn features using a 1D Convolutional Neural Network and then align input sequences (raw IMU signals from Myo-Armband) with output sequences (sequence of characters) using a sequence to sequence model Sutskever et al., 2014 with attention mechanism. Since our dataset had limited amount of data to train this network, an artificially generated datasets (Appendix 1) of various sizes were used to train the network.

Model Architecture

[Figure 1] The high level Model Architecture showing the encoder and decoder with attention

Components

The model consists of the following components

Encoder (1D-CNN): Which like the encoder of an auto-encoder, encodes the input sequence into a feature vector which is decoded into the output sequence.
Decoder (LSTM Network + Attention): A stacked LSTM network with attention was used to decode the feature vector into an output sequence. Both the feature vector and the output sequence were padded for alignment.

CNN Model Architecture and specifications

Bucketing

A bucketing technique has been used for padding variable length input and output sequences. The bucket specs for input and output sequence lengths are as follows,

400 : 4
800 : 8
1200 : 9
1800 : 11
1900 : 13

These buckets were selected by analyzing the distribution of the lengths of input and output sequences. The bucket sizes that made the input sequence length distribution uniform were selected keeping in mind the distribution of number of timesteps of input sequence per output sequence of a certain length. The idea was to not only use the CNN to learn to extracting features from the sensor data but also to fuse sensor data hierarchically.

Different Model hyper parameters that were used for experiments and the corresponding dataset specs

Parameters	Model 1	Model 2
Min. Output Sequence Length	1	1
Max. Output Sequence Length	10	10
Number of training instances	100000	100000
Number of testing instances	1000	1000
Batch Size	64	64
LSTM Layers	3	2
Initial learning rate	0.0001	0.001
LSTM Hidden units	128	128
Optimizer	ADADelta	ADAM
Epochs	38 (approx) (55,800 Iterations)	10 (approx) (15,6000 Iterations)

Other models were also trained and tested with slight variations in the hyper parameters and a smaller and larger dataset of sizes 1000 and 1000000 instances.

Data Preparation

The data used for training and evaluation has been acquired using Myo-Armband and Pewter and the data creator module has been used to process the data.

Results

The results are not as expected. The model overfits and can be made better by just a few tweaks. It learns and the perplexity reduces while training but the results are not good.

Loss

Perplexity

Output

[Figure 2] The results are shown in two parts each. The left hand side shows the Input data sequence. The right hand side shows the heatmap of the attention vector over the input sequence and the title shows the predicted output sequence and the Ground Truth sequence

Conclusion

It is evident from the results thet they are not good. The resons could have been many and I am still working on fixing them. Some issues are obvious and some need a lot of experimentation.

Probable Solution to Issues

Try different CNN architectures by fusing the sensor data at different levels and changing the sizes of the layers and filters.
Pre-train the CNN with existing datasets available for gesture classification and then use it in the above model.
Replace CNN with MFCC features and directly apply the Seq2Seq model with attention.
Preprocess the data before encoding.
Replace CNN with BLSTM as an encoder.
Visualize CNN features using t-SNE and check if they actually make sense or not.

Appendix

1. Data Generation

Artificial datasets were generated by,

Generating random output sequences using the given labels. Eg, “100289”
For every label (character) in the generated output sequence, a random data instance (input sequence for a character) was picked from the original dataset corresponding to the same label.
These randomly picked data instances were concatenated to form an input sequence.

The artificial datasets consisted of such input and output sequences.

No preprocessing was done on the generated data and output sequences of minimum length 1 and maximum length 10 were generated.