amir-zeldes / xrenner

eXternally configurable REference and Non Named Entity Recognizer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

support conllu segmentation/empty nodes

ftyers opened this issue · comments

For compatibility with the Universal Dependencies treebanks, it would be great if xrenner could support the CoNLL-U data format[1], natively through a command line option. Two things which don't seem to work:

  • Skip segmentation spans like 2-4
2-4 dámelo
2    da
3    me
4    lo
  • And empty node spans like 5.1
1      Sue       Sue
2      likes     like
3      coffee    coffee
4      and       and
5      Bill      Bill
5.1    likes     like
6      tea       tea

At the moment, for the segmentation spans we get:

$ python3 xrenner.py -m eng -o html /tmp/sherlock.conllu >/tmp/sherlock.html
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "xrenner.py", line 70, in xrenner_worker
    output = xrenner.analyze(file_, options.format)
  File "/home/fran/source/xrenner/xrenner/modules/xrenner_xrenner.py", line 135, in analyze
    head_id = "0" if cols[6] == "0" else str(int(cols[6]) + self.tokoffset)
ValueError: invalid literal for int() with base 10: '_'
  1. http://universaldependencies.org/format.html