Jason3900 / gector-large

Improved version of GECToR

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Improving Sequence Tagging for Grammatical Error Correction

This repository provides several improvemets to state-of-the-art sequence tagging model for grammatical error correction descibed in following thesis:

Improving Sequence Tagging for Grammatical Error Correction

The code in this repository mainly based on the official implementation from following paper:

GECToR – Grammatical Error Correction: Tag, Not Rewrite

Installation

The following command installs all necessary packages:

pip install -r requirements.txt

The project was tested using Python 3.7.

Datasets

All the public GEC datasets used in the thesis can be downloaded from here.
Knowledge distilled datasets can be downloaded here.
Synthetically PIE created datasets can be generated/downloaded here.

To train the model data has to be preprocessed and converted to special format with the command:

python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE

Pretrained models

All available pretrained models can be downloaded here.

Pretrained encoder BEA-2019 (test)
RoBERTa [link] 73.1
Large RoBERTa voc10k + DeBERTa voc10k + XLNet voc 5k [link] 76.05

Train model

To train the model, simply run:

python train.py --train_set TRAIN_SET --dev_set DEV_SET \
                --model_dir MODEL_DIR

There are a lot of parameters to specify among them:

  • cold_steps_count the number of epochs where we train only last linear layer
  • transformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert} model encoder
  • tn_prob probability of getting sentences with no errors; helps to balance precision/recall
  • pieces_per_token maximum number of subwords per token; helps not to get CUDA out of memory

In our experiments we had 98/2 train/dev split.

Training parameters

We described all parameters that we use for training and evaluating here.

Model inference

To run your model on the input file use the following command:

python predict.py --model_path MODEL_PATH [MODEL_PATH ...] \
                  --vocab_path VOCAB_PATH --input_file INPUT_FILE \
                  --output_file OUTPUT_FILE

Among parameters:

  • min_error_probability - minimum error probability (as in the paper)
  • additional_confidence - confidence bias (as in the paper)
  • special_tokens_fix to reproduce some reported results of pretrained models

For evaluation we use ERRANT.

About

Improved version of GECToR

License:Apache License 2.0


Languages

Language:Jupyter Notebook 57.1%Language:HTML 42.6%Language:Python 0.3%