grammar neural-network keras grammar-correction

Grammar correction with neural network

Seq2Seq (Encoder-Decoder) wiht Attention Mechanism for Grammar Correction in Keras.

How its work

At first, we should create our parallel dataset for training our model. In preprocess folder, lang8 and nucle modules convert each dataset into proper format. Lang8 dataset is very noisy, so I decided to do small preprocessing on that. I remove non-ascii characters, reduce length of character with 3 or more with 1 (e.g token like !!!!!!! convert to !), and remove unnecessary punctuation (all puntuation except {',','.','-'}).

At the final preprocessing step, I do some data augmentation. In each pair (source, target), in addition to existing error, I inject some typo/grmmatical error into the source samples. Things I do in this step include:

Dropout token
Modal replacement
Misspelling tokens
Change tense of verbs
Change singularity/pluarality of nouns
Change preposition

Accourding to this paper, above cases will cover most of the errors in English learner writings.

In training step, I used famous seq2seq Attention model here. The best hyper-parameters for seq2seq explored by the team at google in "Massive Exploration of Neural Machine Translation Architectures" paper. I used one layer encoder/decoder to keep things as simple as posible. It can be easily extend to 4 layer encoder/decoder famework (considering regularization and dropout).

How to use

git clone https://github.com/hadifar/GrammarCorrection.git
cd GrammarCorrection
virtualenv venv
source venv/bin/activate
sudo pip2 install -r requirements.txt
mkdir data
cd data
Download lang8 and NUCLE and put them in data folder.
cd ..
cd preprocess
sh preprocess_script.sh
cd ..
cd models
donwnload fasttext pretrained embeddings and put it in data/embedding folder
sh train_script.sh
That's all :)

How to use in Colab

Look into the riminder.ipynb in the root directory. (do not forget to put the dataset in your google drive).

TODO:

Use character ngram feature
Use language model for checking final output

About

Grammar Correction with Neural Network (Seq2Seq with Attention)

grammar neural-network keras grammar-correction

Languages

Language:Python 69.9%Language:Jupyter Notebook 21.5%Language:Shell 8.6%