*.prp files contain ^M artifacts which break model.setup()

Question

*.prp files contain ^M artifacts which break model.setup()

ypuzikov opened this issue 6 years ago · comments

Observed on LDC2014T12 data instances:

train_380
train_961
train_995
train_1442

After preprocessing, there is this *.prp file which contains annotations done by the Stanford CoreNLP tool. I have noticed that in all the cases above there is a ^M in the middle of CoreNLP output, like so:

[Text=currently CharacterOffsetBegin=0 ... ]^M                                                                                                                                                                                                    
[Text=america CharacterOffsetBegin=10 ... ]^M                                                                                                      
^M                                                                                                                                                                                                                 
[Text=is CharacterOffsetBegin=18 ... ]^M         
...

Not sure why this happens -- maybe CoreNLP does not process multi-sentence instances correctly? In any case, reporting for those who might wonder what is going on.

I solved it by manually deleting the dangling ^M part from the *.prp file.

Chuan · Answer 1 · Mon Feb 19 2018 04:19:49 GMT+0800 (China Standard Time)

That is a carriage-return character often returned by windows. The wrapper for Stanford Corenlp inserted these when outputting the processed file. However, it should be correctly processed corenlp.py though.