SemEval mismatch between original xml and processed files

Question

SemEval mismatch between original xml and processed files

HieuPhan33 opened this issue 3 years ago · comments

Hi Song,
Thanks for your excellent work.

I notice that some sentences in the original SemEval XML files (Laptops_Test_Gold.xml, Restaurants_Test_Gold.) are excluded in the processed txt files (Laptops_Test_Gold.xml.seg).

The statistics do not match. Laptops_Test_Gold.xml has 654 aspect terms, while Laptops_Test_Gold.xml.seg only has 638 terms. Similarly, Restaurant_Test_Gold.xml and Restaurant_Test_Gold.xml.seg, respectively, have 1134 and 1120 terms. There are 14 aspect terms difference between xml and txt files in both domains.

For example, the sentence with id 463:26: "So noise is reduced at least 50% and the heat is much better, now it doesn't feel hot but warm" in Laptops_Test_Gold.xml does not appear in the .seg txt file.

If you have any ideas how the processed txt are generated, could you please explain why there are differences between the xml files in original SemEval dataset and the processed txt files?

Thank you.

Hieu Phan · Answer 1 · Sat Sep 11 2021 19:45:27 GMT+0800 (China Standard Time)

Oh, I realize that you just exclude conflict terms. No worries.