grammarly / gector

Official implementation of the papers "GECToR – Grammatical Error Correction: Tag, Not Rewrite" (BEA-20) and "Text Simplification by Tagging" (BEA-21)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Questions about processing synthetic data

liangnn17 opened this issue · comments

Hi,

I noticed that the tokenization method in PIE data is different from the nucle and fce data you used. I'm wondering whether I need to detokenize the PIE data and use spacy to do tokenization on my own.

Looking forward to your advice!

no, you don't need to tokenize it yourself. You can use the script they provided for preprocessing in order to get the data ready in a compatible format for gector.

Hi @liangnn17
The tokenization for PIE indeed may be a bit different from the one used in BEA data, but I think it wouldn't influence the quality significantly.