gzomer / BeyondGEC

Research code for EMNLP 2021 Paper "Beyond Grammatical Error Correction: Improving L1-influenced research writing in English using pre-trained encoder-decoder models"

Home Page:https://aclanthology.org/2021.findings-emnlp.216/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Beyond GEC

In this paper, we present a new method for training a writing improvement model adapted to the writer’s first language (L1) that goes beyond grammatical error correction (GEC). Without using annotated training data, we rely solely on pre-trained language models fine-tuned with parallel corpora of reference translation aligned with machine translation. We evaluate our model with corpora of academic papers written in English by L1 Portuguese and L1 Spanish scholars and a reference corpus of expert academic English. We show that our model is able to address specific L1-influenced writing and more complex linguistic phenomena than existing methods, outperforming what a state-of-the-art GEC system can achieve in this regard. Our code and data are open to other researchers.

Corpora

ExPACE

Download

BrACE

Download

LACE

Download

Parallel training data

Pt-EN-to-EN and Pt-ES-to-EN

Download

Citation

@inproceedings{zomer-frankenberg-garcia-2021-beyond-grammatical,
    title = "Beyond Grammatical Error Correction: Improving {L}1-influenced research writing in {E}nglish using pre-trained encoder-decoder models",
    author = "Zomer, Gustavo  and
      Frankenberg-Garcia, Ana",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.216",
    pages = "2534--2540",
    abstract = "In this paper, we present a new method for training a writing improvement model adapted to the writer{'}s first language (L1) that goes beyond grammatical error correction (GEC). Without using annotated training data, we rely solely on pre-trained language models fine-tuned with parallel corpora of reference translation aligned with machine translation. We evaluate our model with corpora of academic papers written in English by L1 Portuguese and L1 Spanish scholars and a reference corpus of expert academic English. We show that our model is able to address specific L1-influenced writing and more complex linguistic phenomena than existing methods, outperforming what a state-of-the-art GEC system can achieve in this regard. Our code and data are open to other researchers.",
}

About

Research code for EMNLP 2021 Paper "Beyond Grammatical Error Correction: Improving L1-influenced research writing in English using pre-trained encoder-decoder models"

https://aclanthology.org/2021.findings-emnlp.216/


Languages

Language:Jupyter Notebook 100.0%