mtala3t / GEC-Info

Repository to collect and categorize Grammatical Error Correction papers.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GEC Information

Something information of grammatical error correction :)
Information will be added sometimes.

It can also be viewed on GitHub Pages

Overview

Surveys

Title Year Page Note
"A Comprehensive Survey of Grammar Error Correction" 2020 [paper]

Shared Tasks

Name Year Paper Note
HOO 2011 2011 [paper] [website]
HOO 2012 2012 [paper] [website]
CoNLL-2013 2013 [paper] [website]
CoNLL-2014 2014 [paper] [website] [system outputs]
BEA-2019 2019 [paper] [website] [system outpus]

Datasets

For Training (Real Data)

Name Year Paper Note
EFCamDat 2014 [Automatic Linguistic Annotation ofLarge Scale L2 Databases: The EF-Cambridge Open Language Database(EFCamDat)] [The EF Cambridge Open Language Database (efcamdat) Information for Users] [download v2]
GitHub Typo Corpus 2019 [GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors] [download]
W&I+LOCNESS on BEA2019 Shared Task 2019 [Developing an Automated Writing Placement System for ESL Learners ] [direct download]
FCE 2011 [A New Dataset and Method for Automatically Grading ESOL Texts] [direct download]
NUCLE 2013 [Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English] [download]
ICNALE 2013 [The ICNALE and Sophisticated Contrastive Interlanguage Analysis of Asian Learners of English] [download]
Lang-8 2011 [Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners] [website] [download: Fill this form]
Related tools are useful. See the [Other Tools] for the details.

For Training (Pseudo/Systhetic Data)

Name Year Paper Note
PIE-synthetic 2019 [Parallel Iterative Edit Models for Local Sequence Transduction] [download]

For Evaluation

Name Year Paper Note
KJ 2011 [Creating a manually error-tagged and shallow-parsed learner corpus] [download]
CoNLL-2013 2013 [The CoNLL-2013 Shared Task on Grammatical Error Correction] [direct download]
CoNLL-2014 2014 [The CoNLL-2014 Shared Task on Grammatical Error Correction] [direct download]
10 additional annotations for the CoNLL14 2015 [How Far are We from Fully Automatic High Quality Grammatical Error Correction?] [direct download]
8 additional annotations for the CoNLL14 2016 [Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality] [download]
JFLEG 2017 [JFLEG: A Fluency Corpus and Benchmark for Grammatical Error Correction] [download]
GMEG-Data 2019 [Enabling Robust Grammatical Error Correction in New Domains: Data Sets, Metrics, and Analyses] [code]
CWEB 2020 [Grammatical Error Correction in Low Error Density Domains: A New Benchmark and Analyses] [download]
ErAConD 2021 [ErAConD : Error Annotated Conversational Dialog Dataset for Grammatical Error Correction] [data]
Training dataset is included.

Performance measures

Reference-based

Name Year Paper Note
M^2 Scorer 2012 [Better Evaluation for Grammatical Error Correction] [code]
It is often used to evaluate CoNLL-2013 and CoNLL-2014.
GLEU 2015 [Ground Truth for Grammatical Error Correction Metrics]
[GLEU Without Tuning]
[code]
It is often used to evaluate JFLEG.
I-measure 2015 [Towards a standard evaluation method for grammatical error detection and correction] [code]
Code is available only python 2.x.
ERRANT 2016 [Automatic Extraction of Learner Errors in ESL Sentences Using Linguistically Enhanced Alignments]
[Automatic Annotation and Evaluation of Error Types for Grammatical Error Correction]
[code]
It is often used to evaluate BEA-2019.
GMEG-Metric 2019 [Enabling Robust Grammatical Error Correction in New Domains: Data Sets, Metrics, and Analyses] [code]
Ridge regression using existing metrics (e.g. ERRANT, GLEU) as features.
GoToScorer 2019 [Taking the Correction Difficulty into Account in Grammatical Error Correction Evaluation] [code]
It can be evaluated systems considering error correction difficulty.

Reference-less

Keywords / Overview Year Paper Note
Scoring by counting the errors 2016 [There’s No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction] [code]
Fluency + grammaticality + meaning preservation 2017 [Reference-based Metrics can be Replaced with Reference-less Metrics in Evaluating Grammatical Error Correction Systems]
USim 2018 [Reference-less Measure of Faithfulness for Grammatical Error Correction] [code]
SOME 2020 [SOME: Reference-less Sub-Metrics Optimized for Manual Evaluations of Grammatical Error Correction] [code]
Scribendi Score 2021 [Is this the end of the gold standard? A straightforward reference-less grammatical error correction metric] [Unofficial code]

Quality Estimation

Keywords / Overview Year Paper Note
2022 Proficiency Matters Quality Estimation in Grammatical Error Correction

Models / Architectures

Supervised

Keywords / Overview Year Paper Note
Phrase-based SMT 2016 [Phrase-based Machine Translation is State-of-the-Art for Automatic Grammatical Error Correction] [code]
Neural reinforcement learning 2017 [Grammatical Error Correction with Neural Reinforcement Learning]
Word-level SMT enhanced NNJMs + char-based SMT 2017 [Connecting the Dots: Towards Human-Level Grammatical Error Correction] [code]
First NMT-based approach 2016 [Grammatical error correction using neural machine translation]
SMEG 2017 [Systematically Adapting Machine Translation for Grammatical Error Correction] [code]
A nested attention (word and char attention) 2017 [A Nested Attention Neural Hybrid Model for Grammatical Error Correction]
Re-ranking N-best sentence (by SMT) with LSTM-based GED 2017 [Neural Sequence-Labelling Models for Grammatical Error Correction]
CNN-based Encder-Decoder approach 2018 [A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction] [code]
Fluency boosting learning 2018 [Fluency Boost Learning and Inference for Neural Grammatical Error Correction] [code]
ACL2018
Fluency boosting learning (added round-way error correction) 2018 [Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study] [code]
Microsoft Research Technical Report
Hybrid SMT and NMT 2018 [Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation]
Copy-Augmented Architecture 2019 [Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data] [code]
Consider a few previous sentences 2019 [Cross-Sentence Grammatical Error Correction] [code]
PIE 2019 [Parallel Iterative Edit Models for Local Sequence Transduction] [code]
LaserTagger 2019 [Encode, Tag, Realize: High-Precision Text Editing] [code]
Pretrain by DAE + sequential transfer learning 2019 [A Neural Grammatical Error Correction System Built On Better Pre-training and Sequential Transfer Learning] [code]
BEA-2019: Kakao&Brain
Use sentence-level error dectection 2019 [The AIP-Tohoku System at the BEA-2019 Shared Task] BEA-2019: AIP-Tohoku
Four CNN + eight Transformer 2019 [The LAIX Systems in the BEA-2019 GEC Shared Task] BEA-2019: LAIX
Combine Transformer+CNN with FST + Re-ranking 2019 [Neural and FST-based approaches to grammatical error correction] BEA-2019: CAMB-CLED
Transformer seq2seq + BERT re-ranker 2019 [TMU Transformer System Using BERT for Re-ranking at BEA 2019 Grammatical Error Correction on Restricted Track] BEA-2019: TMU
Apply noisy channel with BERT and GPT-2 as LM 2019 [Noisy Channel for Low Resource Grammatical Error Correction] BEA-2019: Siteimprove
Use Finite State Transducers 2019 [Neural Grammatical Error Correction with Finite State Transducers]
GECToR 2020 [GECToR – Grammatical Error Correction: Tag, Not Rewrite] [code]
BERT-fuse 2020 [Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction] [code]
Adversarial approach (G:seq2seq D:sentence-pair classification) 2020 [Adversarial Grammatical Error Correction]
Erroneous span correction and detection 2020 [Improving the Efficiency of Grammatical Error Correction with Erroneous Span Detection and Correction]
Document-level approach 2020 [Document-level grammatical error correction] [code]
Seq2Edits 2020 [Seq2Edits: Sequence Transduction Using Span-level Edit Operations] [code]
Beam search considering copy probability 2020 [Generating Diverse Corrections with Local Beam Search for Grammatical Error Correction]
BART-based 2020 [Stronger Baselines for Grammatical Error Correction Using a Pretrained Encoder-Decoder Model] [code]
VERNet 2021 [Neural Quality Estimation with Multiple Hypotheses for Grammatical Error Correction] [code]
Shallow Aggressive Decoding 2021 [Instantaneous Grammatical Error Correction with Shallow Aggressive Decoding] [code]
T5-based 2021 [A Simple Recipe for Multilingual Grammatical Error Correction] [code]
GAN-like sequence labeling 2021 [Grammatical Error Correction as GAN-like Sequence Labeling]
Diversity-Driven Combination (DDC) 2021 [Diversity-Driven Combination for Grammatical Error Correction] [code]
Select a system for each error type with IP 2021 [System Combination for Grammatical Error Correction Based on Integer Programming] [code]
Use multiclass GED for Transformer seq2seq and reranking 2021 [Multi-Class Grammatical Error Detection for Correction: A Tale of Two Systems]
GEC for writing improvement model adapted to the writer’s L1 2021 [Beyond Grammatical Error Correction: Improving L1-influenced research writing in English using pre-trained encoder-decoder models] [code]
Constrastive Leaning approach 2021 [Grammatical Error Correction with Contrastive Learning in Low Error Density Domains] [code]
Sequence Span Rewriting 2021 [Improving Sequence-to-Sequence Pre-training via Sequence Span Rewriting]
Dependent Self-Attention (DSA) 2021 [Grammatical Error Correction with Dependency Distance]
2022 Interpretability for Language Learners Using Example-Based Grammatical Error Correction [code]
2022 Type-Driven Multi-Turn Corrections for Grammatical Error Correction [code]
GECToR Large 2022 Ensembling and Knowledge Distilling of Large Sequence Taggers for Grammatical Error Correction [code] [Author's Master Thesis]

Unsupervised

Keywords / Overview Year Paper Note
5-gram LM based approach 2018 [Language Model Based Grammatical Error Correction without Annotated Training Data] [code]
Train GRU models for each of five error types 2018 [A Simple but Effective Classification Model for Grammatical Error Correction]
Use Finite State Transducers 2019 [Neural Grammatical Error Correction with Finite State Transducers]
LSTM tagger for word coice task 2019 [Choosing the Right Word: Using Bidirectional LSTM Tagger for Writing Support Systems] [code]
Use LM (BERT, GPT-1,2) 2019 [The Unreasonable Effectiveness of Transformer Language Models in Grammatical Error Correction]
Create erroneous data from monolingual data 2019 [Minimally-Augmented Grammatical Error Correction] Supervised setting is also performed
LM-Critic 2021 [LM-Critic: Language Models for Unsupervised Grammatical Error Correction] [code]
Supervised setting is also performed

Strategies

Keywords / Overview Year Paper Note
Some methods that can be adapted neural MT 2018 [Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task] [code]
Iterative decoding 2018 [Weakly Supervised Grammatical Error Correction using Iterative Decoding]
Combine systems automatically 2019 [Learning to combine Grammatical Error Corrections] [code]
Add adversarial examples continually 2020 [Improving Grammatical Error Correction Models with Purpose-Built Adversarial Examples]
Cross-lingual Transfer Learning 2020 [Cross-lingual Transfer Learning for Grammatical Error Correction]
Data Weighted Training Strategies 2020 [Data Weighted Training Strategies for Grammatical Error Correction]

Data Augmentation

Keywords / Overview Year Paper Note
Make artificial errors in a probabilistic manner 2014 [Generating artificial errors for grammatical error correction]
Back translation 2016 [Improving Neural Machine Translation Models with Monolingual Data]
SMT based MT + pattern extraction 2017 [Artificial Error Generation with Machine Translation and Syntactic Patterns]
Diverse back translation with noisy beam search 2018 [Noising and Denoising Natural Language: Diverse Backtranslation for Grammar Correction]
DirectNoise 2019 [Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data] The method was first called "DirectNoise" by [kiyono+ 2019]?
Substituting words using confusion sets 2019 [Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data] [synthetic data]
BEA-2019: UEDIN-MS
Error+Context Dictionary 2019 [Improving Precision of Grammatical Error Correction with a Cheat Sheet] BEA-2019: Buffalo
Use Google Translate for making pseudo data 2019 [(Almost) Unsupervised Grammatical Error Correction using a Synthetic Comparable Corpus] BEA-2019: TMU in Low Resource
Inverted Spellchecker + Patterns+POS 2019 [A Comparative Study of Synthetic Data Generation Methods for Grammatical Error Correction]
Methods for erroneous data generation 2019 [Erroneous data generation for Grammatical Error Correction] BEA-2019: Shuyao
Wikipedia revision & Wikipedia round-trip translation 2019 [Corpora Generation for Grammatical Error Correction]
Create confusion sets by edit distance, word embeddings, spell-breaking 2019 [Minimally-Augmented Grammatical Error Correction] Supervised setting is also performed
Explore methods to make pseude data, seed corpus, training settings 2019 [An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction] [code]
2020 [Massive Exploration of Pseudo Data for Grammatical Error Correction]
Control error rates and error types by rule-based corruption and filtered back-translation 2020 [Controllable Data Synthesis Method for Grammatical Error Correction]
Use machine translation pairs 2020 [Improving Grammatical Error Correction with Machine Translation Pairs]
Edit latent representation 2020 [Improving Grammatical Error Correction with Data Augmentation by Editing Latent Representation]
Consider learner’s error tendency 2020 [Grammatical Error Correction Using Pseudo Learner Corpus Considering Learner’s Error Tendency]
Tagged corruption 2021 [Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models] [code]
Use 188 modules 2021 [Various Errors Improve Neural Grammatical Error Correction] [code]
Use real error petterns and linguistic knowledge 2021 [Data Augmentation of Incorporating Real Error Patterns and Linguistic Knowledge for Grammatical Error Correction]
Divide non-English sentence into chunks → translate to English for each of them → concatenate 2021 [Grammatical Error Generation Based on Translated Fragments]

Data Cleaning

Keywords / Overview Year Paper Note
A Self-Refinement Strategy for Noise Reduction 2020 [A Self-Refinement Strategy for Noise Reduction in Grammatical Error Correction]
cLang8 (Cleaned Lang-8) 2021 [A Simple Recipe for Multilingual Grammatical Error Correction] [code]

Analyses / Findings

Keywords / Overview Year Paper Note
Re-rank the CoNLL14 systems by human evaluation 2015 [Human Evaluation of Grammatical Error Correction Systems] [code]
2015 [How Far are We from Fully Automatic High Quality Grammatical Error Correction?]
Human annotation focused on fluency 2016 [Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality] [code]
2017 [GEC into the future: Where are we going and how do we get there?]
MEAGE 2018 [Automatic Metric Validation for Grammatical Error Correction] [code]
2018 [Inherent Biases in Reference-based Evaluation for Grammatical Error Correction] [code]
2018 [Assessing Grammatical Correctness in Language Learning]
Reassess M^2, I-measure, GLEU by comparing human evaluation 2018 [A Reassessment of Reference-Based Grammatical Error Correction Metrics] [code]
Quality estimation (and re-ranking using estimated score) 2018 [Neural Quality Estimation of Grammatical Error Correction] [code]
Evaluate four systems (SMT, CNN, LSTM, Transformer) for six corpora (CoNLL13&14, FCE, JFLEG, KJ, ICNALE) 2019 [Cross-Corpora Evaluation and Analysis of Grammatical Error Correction Models — Is Single-Corpus Evaluation Enough?]
Compare CNN, Transformer, PRPN, ON-LSTM as back-translation models 2019 [The Unbearable Weight of Generating Artificial Errors for Grammatical Error Correction]
GEC for post-processing 2021 Automatic Grammatical Error Correction for Sequence-to-sequence Text Generation: An Empirical Study
CGOP 2020 [Comparison of the Evaluation Metrics for Neural Grammatical Error Correction With Overcorrection] Metric Considering overcorrection
Create new gold data by post-editing system outputs 2021 [How Good (really) are Grammatical Error Correction Systems?]
Explore whether models have grammatical knowledge with Known-setting and Unknown-setting 2021 [Do Grammatical Error Correction Models Realize Grammatical Generalization?]
Compare CNN, LSTM, transformer or combinations of them as BT models 2021 [Comparison of Grammatical Error Correction Using Back-Translation Models]
2022 Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models

Spoken Domain

Keywords / Overview Year Paper Note
Detection 2019 AUTOMATIC GRAMMATICAL ERROR DETECTION OF NON-NATIVE SPOKEN LEARNER ENGLISH
Detection 2020 Grammatical error detection in transcriptions of spoken English
Correction, disfluency detection model 2020 Spoken Language ‘Grammatical Error Correction’

Applications

Name Year Paper Note
GECko++ [GECko+: a Grammatical and Discourse Error Correction Tool] [website] [code]
An English assiting tool. Correction grammatical error and re-ordering sentences automatically.
MiSS 2021 [MiSS: An Assistant for Multi-Style Simultaneous Translation] [website] [demo video]

Projects

Name Website
GramFormer [GitHub]

Other Tools

Name Code Note
Lang8-NAIST-extractor [code] Scripts for extracting error-correct pairs from the Lang-8 Corpus.
M2Converter [code] Scripts for converting m2 file into source file and target file.
EFCamDat-Preprocess [code]

Other materials

Name Paper Note
NLP-progress [website]
The performance ranking on some datasets.
A Crash Course in Automatic Grammatical Error Correction [paper] [materials]
The tutorial about GEC in COLING2020.
Chunngai/gec-papers [github]
The papers are being compiled around 2019-2020?

Related Tasks

Grammatical Error Detection

Keywords / Overview Year Paper Note
A weighted measure according to crowdsourcing results (for GED) 2011 [They Can Help: Using Crowdsourcing to Improve the Evaluation of Grammatical Error Detection Systems]
2018 [Wronging a Right: Generating Better Errors to Improve Grammatical Error Detection] [code]
Bi-LSTM with contextual word embeddings 2019 [Context is Key: Grammatical Error Detection with Contextual Word Representations]
Multi-head and multi-layer attention 2019 [Multi-Head Multi-Layer Attention to Deep Language Representations for Grammatical Error Detection]
2021 [Exploring the Capacity of a Large-scale Masked Language Model to Recognize Grammatical Errors]

Feedback Comment Generation

Keywords / Overview Year Paper Note
2014 [Correcting Preposition Errors in Learner English Using Error Case Frames and Feedback Messages]
English grammar checker with feedback in Japanese 2018 [Grammatical Error Checker for Japanese Learners of English] This is not a research as a feedback comment generation, but I classify it here for now
2019 [Toward a Task of Feedback Comment Generation for Writing Learning]
2020 [Creating Corpora for Research in Feedback Comment Generation]
2021 [Shared Task on Feedback Comment Generation for Language Learners]

Other Languages

Arabic

Keywords / Overview Year Paper Note
Arabic Learner Corpus 2013 [Arabic Learner Corpus v1: A New Resource for Arabic Language Research] [website]
QALB 2014 [Large Scale Arabic Error Annotation: Guidelines and Framework] [QALB Project Website]
QALB 2014 Shared Task 2014 [The First QALB Shared Task on Automatic Text Correction for Arabic] [website]
QALB 2015 Shared Task 2015 [The Second QALB Shared Task on Automatic Text Correction for Arabic]
ARETA 2021 [Automatic Error Type Annotation for Arabic] [code]

Bangla

Keywords / Overview Year Paper Note
2021 [Development of Bangla Spell and Grammar Checkers: Resource Creation and Evaluation]

Chinese

Keywords / Overview Year Paper Note
NLPCC-2018 Shared Task 2018 [Overview of the NLPCC 2018 Shared Task: Grammatical Error Correction] [data]
Two-stage: Spell checker → seq2seq 2019 [A Two-Stage Model for Chinese Grammatical Error Correction]
CNN-based seq2seq 2019 [Chinese Grammatical Error Correction Based on Convolutional Sequence to Sequence Model]
MaskGEC 2020 [MaskGEC: Improving Neural Grammatical Error Correction via Dynamic Masking]
2020 [Chinese Grammatical Error Detection Based on BERT Model]
2020 [BERT Enhanced Neural Machine Translation and Sequence Tagging Model for Chinese Grammatical Error Diagnosis]
2020 [Heterogeneous Recycle Generation for Chinese Grammatical Error Correction]
NLPTEA-2020 Shared Task 2020 [Overview of NLPTEA-2020 Shared Task for Chinese Grammatical Error Diagnosis]
Tail-to-Tail Non-Autoregressive Sequence Prediction 2021 [Tail-to-Tail Non-Autoregressive Sequence Prediction for Chinese Grammatical Error Correction]
2021 "Is Whole Word Masking Always Better for Chinese BERT?": Probing on Chinese Grammatical Error Correction

Czech

Keywords / Overview Year Paper Note
AKCES-GEC dataset 2019 [Grammatical Error Correction in Low-Resource Scenarios] [data]
Grammar Error Correction Corpus for Czech (GECCC) 2022 Czech Grammar Error Correction with a Large and Diverse Corpus [data]

Geek

Keywords / Overview Year Paper Note
Greek Learner Corpus 2018 [Stand-off annotation in learner corpora: compiling the Greek Learner Corpus (GLC)]
ELERRANT 2021 [ELERRANT: Automatic Grammatical Error Type Classification for Greek] [code]

German

Keywords / Overview Year Paper Note
Falko-MERLIN dataset 2018 [Using Wikipedia Edits in Low Resource Grammatical Error Correction] [data]

Hindi

Keywords / Overview Year Paper Note
2014 [Detection and correction of non word spelling errors in Hindi language]
HiWikiEd dataset 2020 [Generating Inflectional Errors for Grammatical Error Correction in Hindi] [data]

Japanese

Keywords / Overview Year Paper Note
Character-level RNN-based seq2seq 2018 [Automatic Error Correction on Japanese Functional Expressions Using Character-based Neural Machine Translation]
Constructing retrieval system for Japanese GEC 2019 [Grammatical-Error-Aware Incorrect Example Retrieval System for Learners of Japanese as a Second Language]
TMU Evaluation Corpus for Japanese Learners 2020 [Construction of an Evaluation Corpus for Grammatical Error Correction for Learners of Japanese as a Second Language] [data: Fill this form]
Non-Autoregressive approach 2020 [Non-Autoregressive Grammatical Error Correction Toward a Writing Support System]
2022 Construction of a Quality Estimation Dataset for Automatic Evaluation of Japanese Grammatical Error Correction

Lithuanian

Keywords / Overview Year Paper Note
2022 Towards Lithuanian grammatical error correction [code]

Romain

Keywords / Overview Year Paper Note
2020 [Neural Grammatical Error Correction for Romanian] [code]

Russian

Keywords / Overview Year Paper Note
RULEC-GEC dataset 2019 [Grammar Error Correction in Morphologically Rich Languages: The Case of Russian] [data]
RU-Lang8 dataset 2021 [New Dataset and Strong Baselines for the Grammatical Error Correction of Russian] [data]

Spanish

Keywords / Overview Year Paper Note
COWS-L2H 2020 [Developing NLP Tools with a New Corpus of Learner Spanish] [data]

Ukrainian

Keywords / Overview Year Paper Note
UA-GEC 2021 [UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language] [data]

About

Repository to collect and categorize Grammatical Error Correction papers.