marekrei / CWEB

CWEB (Corrected Websites) corpus

CWEB is an evaluation dataset for grammatical error correction (GEC) consisting of website text generated by English speakers of varying levels of proficiency. In contains 13,574 sentences from 1,078 websites which have been annotated for grammatical errors.

Description of this corpus can be found in the paper:

Grammatical Error Correction in Low Error Density Domains: A New Benchmark and Analyses
Simon Flachs, Ophélie Lacroix, Helen Yannakoudakis, Marek Rei and Anders Søgaard In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)

Please cite this paper when using the dataset.

Data

data/raw contains the untokenized parallel data
data/tokenized contains the tokenized parallel data (tokenized with Spacy 1.9)
data/m2 contains M2 files created with ERRANT against annotators combined and individually.

Questions

Please e-mail Simon Flachs (flachs[at]di.ku.dk).

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

About

Languages

Language:Macaulay2 100.0%