*.prp files contain ^M artifacts which break model.setup()
ypuzikov opened this issue · comments
Observed on LDC2014T12 data instances:
- train_380
- train_961
- train_995
- train_1442
After preprocessing, there is this *.prp file which contains annotations done by the Stanford CoreNLP tool. I have noticed that in all the cases above there is a ^M in the middle of CoreNLP output, like so:
[Text=currently CharacterOffsetBegin=0 ... ]^M
[Text=america CharacterOffsetBegin=10 ... ]^M
^M
[Text=is CharacterOffsetBegin=18 ... ]^M
...
Not sure why this happens -- maybe CoreNLP does not process multi-sentence instances correctly? In any case, reporting for those who might wonder what is going on.
I solved it by manually deleting the dangling ^M part from the *.prp file.
That is a carriage-return character often returned by windows. The wrapper for Stanford Corenlp inserted these when outputting the processed file. However, it should be correctly processed corenlp.py though.