c-amr / camr

Transition-based tree-to-graph AMR Parser

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

*.prp files contain ^M artifacts which break model.setup()

ypuzikov opened this issue · comments

Observed on LDC2014T12 data instances:

  • train_380
  • train_961
  • train_995
  • train_1442

After preprocessing, there is this *.prp file which contains annotations done by the Stanford CoreNLP tool. I have noticed that in all the cases above there is a ^M in the middle of CoreNLP output, like so:

[Text=currently CharacterOffsetBegin=0 ... ]^M                                                                                                                                                                                                    
[Text=america CharacterOffsetBegin=10 ... ]^M                                                                                                      
^M                                                                                                                                                                                                                 
[Text=is CharacterOffsetBegin=18 ... ]^M         
...

Not sure why this happens -- maybe CoreNLP does not process multi-sentence instances correctly? In any case, reporting for those who might wonder what is going on.

I solved it by manually deleting the dangling ^M part from the *.prp file.

commented

That is a carriage-return character often returned by windows. The wrapper for Stanford Corenlp inserted these when outputting the processed file. However, it should be correctly processed corenlp.py though.