support conllu segmentation/empty nodes
ftyers opened this issue · comments
For compatibility with the Universal Dependencies treebanks, it would be great if xrenner could support the CoNLL-U data format[1], natively through a command line option. Two things which don't seem to work:
- Skip segmentation spans like
2-4
2-4 dámelo
2 da
3 me
4 lo
- And empty node spans like
5.1
1 Sue Sue
2 likes like
3 coffee coffee
4 and and
5 Bill Bill
5.1 likes like
6 tea tea
At the moment, for the segmentation spans we get:
$ python3 xrenner.py -m eng -o html /tmp/sherlock.conllu >/tmp/sherlock.html
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "xrenner.py", line 70, in xrenner_worker
output = xrenner.analyze(file_, options.format)
File "/home/fran/source/xrenner/xrenner/modules/xrenner_xrenner.py", line 135, in analyze
head_id = "0" if cols[6] == "0" else str(int(cols[6]) + self.tokoffset)
ValueError: invalid literal for int() with base 10: '_'