yanghh2000 / Alirector

Source code of paper "Alirector: Alignment-Enhanced Chinese Grammatical Error Corrector" (Findings of ACL 2024)

Home Page:https://arxiv.org/abs/2402.04601

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Alirector: Alignment-Enhanced Chinese Grammatical Error Corrector (Findings of ACL 2024)

Environment

To install the environment, run:

pip install -r requirements.txt

Data

Download

MuCGEC and NLPCC18: download links can be found in the MuCGEC repository.

FCGEC: FCGEC repository.

NaCGEC: NaCGEC repository.

Process

Process the data into the same format as data/MuCGEC/train_examples.json.

Using data/MuCGEC/utils.pyto split the data into two parts for two-stage training.

Download Pretrained Models

Chinese BART large: Hugging Face Link

Baichuan2-7B-Base: Hugging Face Link

Training

Initial Correction Model (Stage 1 Data)

# bart
bash seq2seq/scripts/train_stage1.sh

# baichuan2
bash llm/scripts/train_stage1.sh

Generate Prediction for Stage 2 Data

# bart
bash seq2seq/scripts/generate_stage2_pred.sh

# baichuan2
bash llm/scripts/generate_stage2_pred.sh

Alignment Model (Stage 2 Data)

# bart
bash seq2seq/scripts/train_align.sh

# baichuan2
bash llm/scripts/train_align.sh

Alignment Distillation (Stage 2 Data)

# bart
bash seq2seq/scripts/train_alignment_distill.sh

# baichuan2
bash llm/scripts/train_alignment_distall.sh

Predict and Evaluate

For predicting, please use llm/src/predict.py or seq2seq/src/predict.py.

For evaluation, we adopt the ChERRANT scorer to calculate character-level P/R/F0.5 for FCGEC and NaCGEC, and M2Scorer to calculate word-level P/R/F0.5 for NLPCC18-Test. For the usage, please refer to this script.

About

Source code of paper "Alirector: Alignment-Enhanced Chinese Grammatical Error Corrector" (Findings of ACL 2024)

https://arxiv.org/abs/2402.04601


Languages

Language:Python 92.2%Language:Shell 7.8%