This repository provides several improvemets to state-of-the-art sequence tagging model for grammatical error correction descibed in following thesis:
The code in this repository mainly based on the official implementation from following paper:
The following command installs all necessary packages:
pip install -r requirements.txt
The project was tested using Python 3.7.
All the public GEC datasets used in the thesis can be downloaded from here.
Knowledge distilled datasets can be downloaded here.
Synthetically PIE created datasets can be generated/downloaded here.
To train the model data has to be preprocessed and converted to special format with the command:
python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE
All available pretrained models can be downloaded here.
Pretrained encoder | BEA-2019 (test) |
---|---|
RoBERTa [link] | 73.1 |
Large RoBERTa voc10k + DeBERTa voc10k + XLNet voc 5k [link] | 76.05 |
To train the model, simply run:
python train.py --train_set TRAIN_SET --dev_set DEV_SET \
--model_dir MODEL_DIR
There are a lot of parameters to specify among them:
cold_steps_count
the number of epochs where we train only last linear layertransformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert}
model encodertn_prob
probability of getting sentences with no errors; helps to balance precision/recallpieces_per_token
maximum number of subwords per token; helps not to get CUDA out of memory
In our experiments we had 98/2 train/dev split.
We described all parameters that we use for training and evaluating here.
To run your model on the input file use the following command:
python predict.py --model_path MODEL_PATH [MODEL_PATH ...] \
--vocab_path VOCAB_PATH --input_file INPUT_FILE \
--output_file OUTPUT_FILE
Among parameters:
min_error_probability
- minimum error probability (as in the paper)additional_confidence
- confidence bias (as in the paper)special_tokens_fix
to reproduce some reported results of pretrained models
For evaluation we use ERRANT.