Questions about processing synthetic data
liangnn17 opened this issue · comments
Hi,
I noticed that the tokenization method in PIE data is different from the nucle and fce data you used. I'm wondering whether I need to detokenize the PIE data and use spacy to do tokenization on my own.
Looking forward to your advice!
no, you don't need to tokenize it yourself. You can use the script they provided for preprocessing in order to get the data ready in a compatible format for gector.
Hi @liangnn17
The tokenization for PIE indeed may be a bit different from the one used in BEA data, but I think it wouldn't influence the quality significantly.