buda-base / lucene-sa

Lucene analyzer for Sanskrit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

lucene 7 add token offset problems

eroux opened this issue · comments

when analyzing "dharmottara"@sa-alalc97, the new Fuseki is unhappy:

startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=5,endOffset=11,lastStartOffset=7 for field 'rdfsLabel_sa-alalc97'

there might be an ungarded call, and we could update to Lucene 7 API

There was an unguarded call to setOffset in SkrtSyllableTokenizer.end. I'll make a release for version 1.0.6 that includes Élie's changes as well and test.

We are now on lucene-sa v1.0.10 and a try/catch is added to TextIndexLucene.addDocument to capture failures that occur outside of lucene-sa during the indexing process. This allows to collect more specific data that should help isolate the problem in lucene-sa:

SANSKRIT_FAILURES-lucene-sa-1.0.10.txt