Improving Sequence Tagging for Grammatical Error Correction

This repository provides several improvemets to state-of-the-art sequence tagging model for grammatical error correction descibed in following thesis:

Improving Sequence Tagging for Grammatical Error Correction

The code in this repository mainly based on the official implementation from following paper:

GECToR – Grammatical Error Correction: Tag, Not Rewrite

Installation

The following command installs all necessary packages:

pip install -r requirements.txt

The project was tested using Python 3.7.

Datasets

All the public GEC datasets used in the thesis can be downloaded from here.
Knowledge distilled datasets can be downloaded here.
Synthetically PIE created datasets can be generated/downloaded here.

To train the model data has to be preprocessed and converted to special format with the command:

python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE

Pretrained models

All available pretrained models can be downloaded here.

Pretrained encoder	BEA-2019 (test)
RoBERTa [link]	73.1
Large RoBERTa voc10k + DeBERTa voc10k + XLNet voc 5k [link]	76.05

Train model

To train the model, simply run:

python train.py --train_set TRAIN_SET --dev_set DEV_SET \
                --model_dir MODEL_DIR

There are a lot of parameters to specify among them:

cold_steps_count the number of epochs where we train only last linear layer
transformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert} model encoder
tn_prob probability of getting sentences with no errors; helps to balance precision/recall
pieces_per_token maximum number of subwords per token; helps not to get CUDA out of memory

In our experiments we had 98/2 train/dev split.

Training parameters

We described all parameters that we use for training and evaluating here.

Model inference

To run your model on the input file use the following command:

python predict.py --model_path MODEL_PATH [MODEL_PATH ...] \
                  --vocab_path VOCAB_PATH --input_file INPUT_FILE \
                  --output_file OUTPUT_FILE

Among parameters:

min_error_probability - minimum error probability (as in the paper)
additional_confidence - confidence bias (as in the paper)
special_tokens_fix to reproduce some reported results of pretrained models

For evaluation we use ERRANT.

Jason3900 / gector-large

Improving Sequence Tagging for Grammatical Error Correction

Installation

Datasets

Pretrained models

Train model

Training parameters

Model inference

About

Languages