Questions about processing synthetic data

Question

Questions about processing synthetic data

liangnn17 opened this issue 3 years ago · comments

Hi,

I noticed that the tokenization method in PIE data is different from the nucle and fce data you used. I'm wondering whether I need to detokenize the PIE data and use spacy to do tokenization on my own.

Looking forward to your advice!

Mina Ashraf · Answer 1 · Fri Apr 22 2022 16:04:37 GMT+0800 (China Standard Time)

no, you don't need to tokenize it yourself. You can use the script they provided for preprocessing in order to get the data ready in a compatible format for gector.

Alex Skurzhanskyi · Answer 2 · Tue Apr 26 2022 17:18:25 GMT+0800 (China Standard Time)

Hi @liangnn17
The tokenization for PIE indeed may be a bit different from the one used in BEA data, but I think it wouldn't influence the quality significantly.