amir-zeldes / xrenner

eXternally configurable REference and Non Named Entity Recognizer

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

False positive in cycle detection

coryandrewtaylor opened this issue · comments

I am trying to run a tokenized copy of the Gospel of John through xrenner, but it is misidentifying cycles in the syntax tree. I've tried various iterations of the input file. If the file contains more than one sentence, a cycle is detected at the third token in the file. (I've verified, with nltk and by hand, that the sentences in question do not actually have a cycle in their syntax trees.)

For example, for one file (gospel_john_test_1.txt), I get the following error:

((py27)) C:\Users\ctaylor\Documents\xrenner>python xrenner.py ./input/gospel_john_test_1.txt > ./output/gospel_john_test_1.sgml

Cycle detected in syntax tree in sentence 1 (child of token: '[')
Exiting due to invalid input

For a second file, (gospel_john_test_2.txt), I get:

((py27)) C:\Users\ctaylor\Documents\xrenner>python xrenner.py ./input/gospel_john_test_2.txt > ./output/gospel_john_test_2.sgml

Cycle detected in syntax tree in sentence 1 (child of token: 'beginning')
Exiting due to invalid input

But a third file (gospel_john_test_3.txt), which contains the first sentence from gospel_john_test_2, works fine.

I have been able to run xrenner successfully on the full text of the Gospel of Mark (gospel_mark_test.txt). I scraped Mark and John from the same edition and tokenized both of them with spaCy (code here), so I don't think it's a problem with the file's structure.

I'm on Windows 7, Anaconda 4.1.0, Python 2.7.11.

As far as I can tell, there are two problems in john_test_1:

  • Token 2 in sentence 1 has empty columns, including the token and lemma column, but perhaps more crucially the dependency function column; however this is not what is causing the error (it may cause an error later on though); if you genuinely want empty columns, try putting an underscore in them (except for column 7 - that must be numeric, so you can put a 0 there)
  • There is no blank line between sentence 1 and sentence 2. The blank line is what tells the system another sentence has started, so in its absence, the system thinks there are two tokens with ID 1, 2, etc., and as a result phantom 'cycles' are created (actually involving tokens from two sentences). I think it will also work extremely slowly if it thinks everything is one huge sentence, since it processes sentence-wise.

File 2 has the same problem, whereas file 3 has only one sentence. I'm very surprised that the full text worked for you!

Basically what you want is sentences separated by blank lines, where each sentence has one token per line, ten filled columns per token, with numerical values (non-cycle forming) in columns 1 and 7. Does that work?

Yes, it is working now. Thanks!

(Thanks also for clearing up the speed issue. The Mark file took over 24
hours to run the first time, and ended up using close to 3 GB of memory. I
just ran it and the John file with line breaks added between sentences, and
each only took a few minutes.)

On Sat, Jun 18, 2016 at 8:24 PM, Amir Zeldes notifications@github.com
wrote:

As far as I can tell, there are two problems in john_test_1:

  • Token 2 in sentence 1 has empty columns, including the token and
    lemma column, but perhaps more crucially the dependency function column;
    however this is not what is causing the error (it may cause an error later
    on though); if you genuinely want empty columns, try putting an underscore
    in them (except for column 7 - that must be numeric, so you can put a 0
    there)
  • There is no blank line between sentence 1 and sentence 2. The blank
    line is what tells the system another sentence has started, so in its
    absence, the system thinks there are two tokens with ID 1, 2, etc., and a
    result phantom 'cycles' are created (actually involving tokens from two
    sentences). I think it will also work extremely slowly if it thinks
    everything is one huge sentence, since it processes sentence-wise.

File 2 has the same problem, whereas file 3 has only one sentence. I'm
very surprised that the full text worked for you!

Basically what you want is sentences separated by blank lines, where each
sentence has one token per line, ten filled columns per token, with
numerical values (non-cycle forming) in columns 1 and 7. Does that work?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#36 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AMJ_i_bnFhyU2pQfnyl4-3hH-FQhtYRhks5qNJo6gaJpZM4I4-NP
.

Great, I should probably add the blank line between sentences to the documentation. BTW if you're running multiple documents through, calling the system in batch mode like this (using glob syntax) will be faster, and if you add the -v flag (verbose) it will let you know about progress too:

> python xrenner.py -v *.conll10

This way it only loads the models once, so you save time. Generally speaking, multiple shorter documents are faster and more accurate than giving it very long ones, since the longer context offers possible false antecedents across text boundaries.