lucene 7 add token offset problems

Question

lucene 7 add token offset problems

eroux opened this issue 6 years ago · comments

when analyzing "dharmottara"@sa-alalc97, the new Fuseki is unhappy:

startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=5,endOffset=11,lastStartOffset=7 for field 'rdfsLabel_sa-alalc97'

there might be an ungarded call, and we could update to Lucene 7 API

Chris Tomlinson · Answer 1 · Fri Nov 09 2018 00:44:46 GMT+0800 (China Standard Time)

There was an unguarded call to setOffset in SkrtSyllableTokenizer.end. I'll make a release for version 1.0.6 that includes Élie's changes as well and test.

Chris Tomlinson · Answer 2 · Wed Nov 14 2018 02:06:10 GMT+0800 (China Standard Time)

We are now on lucene-sa v1.0.10 and a try/catch is added to TextIndexLucene.addDocument to capture failures that occur outside of lucene-sa during the indexing process. This allows to collect more specific data that should help isolate the problem in lucene-sa:

SANSKRIT_FAILURES-lucene-sa-1.0.10.txt