Can not import the retrieval from anserini

Question

Can not import the retrieval from anserini

Jia-py opened this issue a year ago · comments

Hi! Thanks for your great work!
I want to do experiments on terrier with the bm25 from anserini. But I met the following error:
jnius.JavaException: JVM exception occurred: io/anserini/eval/Qrels java.lang.NoClassDefFoundError

Here is my code:

>>> indexref = pt.IndexRef.of('./index/data.properties')
>>> index = pt.IndexFactory.of(indexref)
>>> bm25 = pt.anserini.AnseriniBatchRetrieve(index,wmodel='BM25')

I have installed the jdk11 and found the code in documentation:

trIndex = "/path/to/data.properties"
luceneIndex = "/path/to/lucene-index-dir"
BM25_ai = pt.anserini.AnseriniBatchRetrieve(luceneIndex, wmodel="BM25")

Could it be because I haven't imported the Lucene index? I'm currently using another open-source search library, beir, and have Elasticsearch running. Is there a convenient way to obtain the Lucene index that can be read in this context?

Thanks.

Craig Macdonald · Answer 1 · Fri Sep 15 2023 22:07:29 GMT+0800 (China Standard Time)

Did you start pt.init() with anserini in the boot_classpath? like
https://github.com/terrier-org/pyterrier/blob/master/tests/anserini/test_anserini.py#L23

I think we have only tested with 0.9.2 which is probably old now. Which version of Anserini are you using?

Craig Macdonald · Answer 2 · Fri Sep 15 2023 22:12:09 GMT+0800 (China Standard Time)

NoClassDefFoundError is usually because of either the JVM process was forked, or a dependency jar file was missing from the classpath. Because we ask for the fatjar, it should be included.

Can you show the Python stack trace? A colab with mimium working example would be helpful.

Jia Pengyue · Answer 3 · Fri Sep 15 2023 22:14:12 GMT+0800 (China Standard Time)

Thanks for your reply.
Yes, I started pt.init() with anserini 0.9.2 and here is the python stack trace:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyterrier/anserini.py", line 68, in __init__
    from pyserini.search import SimpleSearcher
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/__init__.py", line 17, in <module>
    from ._base import JQuery, JQueryGenerator, JDisjunctionMaxQueryGenerator, get_topics,\
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/_base.py", line 35, in <module>
    JQrels = autoclass('io.anserini.eval.Qrels')
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/jnius/reflect.py", line 209, in autoclass
    c = find_javaclass(clsname)
  File "jnius/jnius_export_func.pxi", line 22, in jnius.find_javaclass
  File "jnius/jnius_utils.pxi", line 79, in jnius.check_exception
jnius.JavaException: JVM exception occurred: io/anserini/eval/Qrels java.lang.NoClassDefFoundError

Craig Macdonald · Answer 4 · Fri Sep 15 2023 22:17:59 GMT+0800 (China Standard Time)

Ah, I forgot pyserini is involved. I think the pyserini version has to match the anserini version. Our unit tests use pyserini==0.9.4.

If your anserini index is newer than that, then you can try upgrading. I'm happy to have a PR for more recent pyserini support, but its not something we use ourselves.

Jia Pengyue · Answer 5 · Fri Sep 15 2023 22:23:55 GMT+0800 (China Standard Time)

Thank you, I'll have a try.

Craig Macdonald · Answer 6 · Fri Sep 15 2023 22:26:26 GMT+0800 (China Standard Time)

(I'm also thinking that Anserini support could move from Pyterrier itself into a smaller separate repo, just like we do for pyterrier_colbert etc). That would enable better unit testing etc.

Craig Macdonald · Answer 7 · Sat Sep 16 2023 00:22:41 GMT+0800 (China Standard Time)

Let me know how you get on.

Jia Pengyue · Answer 8 · Sat Sep 16 2023 01:17:43 GMT+0800 (China Standard Time)

I degrade the pyterrier to 0.9.4, and download the lucene index from pyserini, I met this error.

>>> luceneIndex = "/root/.cache/pyserini/indexes/lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.57b812594b11d064a23123137ae7dade"
>>> BM25_ai = pt.anserini.AnseriniBatchRetrieve(luceneIndex, wmodel="BM25")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyterrier/anserini.py", line 69, in __init__
    self.searcher = SimpleSearcher(index_location)
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/_searcher.py", line 48, in __init__
    self.object = JSimpleSearcher(JString(index_dir))
  File "jnius/jnius_export_class.pxi", line 270, in jnius.JavaClass.__init__
  File "jnius/jnius_export_class.pxi", line 384, in jnius.JavaClass.call_constructor
  File "jnius/jnius_utils.pxi", line 79, in jnius.check_exception
jnius.JavaException: JVM exception occurred: Format version is not supported (resource BufferedChecksumIndexInput(MMapIndexInput(path="/root/.cache/pyserini/indexes/lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.57b812594b11d064a23123137ae7dade/segments_1"))): 10 (needs to be between 7 and 9) org.apache.lucene.index.IndexFormatTooNewException

And I upgrade the anserini by pt.init(boot_packages=["io.anserini:anserini:0.22.0:fatjar"])

>>> luceneIndex = "/root/.cache/pyserini/indexes/lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.57b812594b11d064a23123137ae7dade"
>>> BM25_ai = pt.anserini.AnseriniBatchRetrieve(luceneIndex, wmodel="BM25")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyterrier/anserini.py", line 69, in __init__
    self.searcher = SimpleSearcher(index_location)
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/_searcher.py", line 49, in __init__
    self.num_docs = self.object.getTotalNumDocuments()
AttributeError: 'io.anserini.search.SimpleSearcher' object has no attribute 'getTotalNumDocuments'

Craig Macdonald · Answer 9 · Sat Sep 16 2023 01:22:50 GMT+0800 (China Standard Time)

I degrade the pyterrier to 0.9.4,

You mean pyserini to 0.9.4?

jnius.JavaException: JVM exception occurred: Format version is not supported (resource BufferedChecksumIndexInput(MMapIndexInput(path="/root/.cache/pyserini/indexes/lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.57b812594b11d064a23123137ae7dade/segments_1"))): 10 (needs to be between 7 and 9) org.apache.lucene.index.IndexFormatTooNewException

This error is because anserini and pyserini is too old for your index. So newer is needed

AttributeError: 'io.anserini.search.SimpleSearcher' object has no attribute 'getTotalNumDocuments'

This error is because your pyserini version does not match your anserini fat jar. You have to keep them in sync somehow.

Jia Pengyue · Answer 10 · Sat Sep 16 2023 01:56:51 GMT+0800 (China Standard Time)

Thanks for your reply! I followed your advice and used anserini and pyserini both at 0.22.0. There is still something wrong with it.

>>> luceneIndex = "/root/.cache/pyserini/indexes/lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.57b812594b11d064a23123137ae7dade"
>>> BM25_ai = pt.anserini.AnseriniBatchRetrieve(luceneIndex, wmodel="BM25")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyterrier/anserini.py", line 68, in __init__
    from pyserini.search import SimpleSearcher
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/__init__.py", line 19, in <module>
    from .lucene import JLuceneSearcherResult, LuceneSimilarities, LuceneFusionSearcher, LuceneSearcher
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/lucene/__init__.py", line 18, in <module>
    from ._impact_searcher import JImpactSearcherResult, LuceneImpactSearcher, SlimSearcher
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/lucene/_impact_searcher.py", line 34, in <module>
    from pyserini.index import Document
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/index/__init__.py", line 21, in <module>
    from .lucene._base import Document, Generator, IndexTerm, Posting, IndexReader
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/index/lucene/__init__.py", line 17, in <module>
    from ._base import Document, Generator, IndexTerm, Posting, IndexReader
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/index/lucene/_base.py", line 30, in <module>
    from pyserini.analysis import get_lucene_analyzer, JAnalyzer, JAnalyzerUtils
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/analysis/__init__.py", line 17, in <module>
    from ._base import get_lucene_analyzer, Analyzer, JAnalyzer, JAnalyzerUtils, JDefaultEnglishAnalyzer, JWhiteSpaceAnalyzer
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/analysis/_base.py", line 26, in <module>
    JDanishAnalyzer = autoclass('org.apache.lucene.analysis.da.DanishAnalyzer')
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/jnius/reflect.py", line 209, in autoclass
    c = find_javaclass(clsname)
  File "jnius/jnius_export_func.pxi", line 22, in jnius.find_javaclass
  File "jnius/jnius_utils.pxi", line 79, in jnius.check_exception
jnius.JavaException: JVM exception occurred: Bad type on operand stack
Exception Details:
  Location:
    org/apache/lucene/analysis/da/DanishAnalyzer.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents; @65: invokespecial
  Reason:
    Type 'org/tartarus/snowball/ext/DanishStemmer' (current frame, stack[3]) is not assignable to 'org/tartarus/snowball/SnowballStemmer'
  Current Frame:
    bci: @65
    flags: { }
    locals: { 'org/apache/lucene/analysis/da/DanishAnalyzer', 'java/lang/String', 'org/apache/lucene/analysis/Tokenizer', 'org/apache/lucene/analysis/TokenStream' }
    stack: { uninitialized 53, uninitialized 53, 'org/apache/lucene/analysis/TokenStream', 'org/tartarus/snowball/ext/DanishStemmer' }
  Bytecode:
    0000000: bb00 0959 b700 0a4d bb00 0b59 2cb7 000c
    0000010: 4ebb 000d 592d 2ab4 000e b700 0f4e 2ab4
    0000020: 0008 b600 109a 0010 bb00 1159 2d2a b400
    0000030: 08b7 0012 4ebb 0013 592d bb00 1459 b700
    0000040: 15b7 0016 4ebb 0017 592c 2db7 0018 b0  
  Stackmap Table:
    append_frame(@53,Object[#57],Object[#58])
 java.lang.VerifyError

Craig Macdonald · Answer 11 · Sat Sep 16 2023 02:40:35 GMT+0800 (China Standard Time)

Gosh, I have never seen this error before.

I think, maybe, that Terrier ships with one of Snowballs' Danish stemmer, and Lucene ships with another. If this is the case, a bit of hacking will be needed to address this.

Craig Macdonald · Answer 12 · Sat Sep 16 2023 02:41:28 GMT+0800 (China Standard Time)

Have you considered just using the results file output from anserini, and using pt.Transformer.from_df(pt.io.read_results(file)) instead.

Jia Pengyue · Answer 13 · Sun Sep 17 2023 01:18:01 GMT+0800 (China Standard Time)

Thank you for your advice! But I didn't generate any result files with Pyserini.

I decided to just use pyterrier to finish my work now. By the way, after retrieving, I got a dataframe that contains fields such as qid, docid, docno, rank, score, and query. How can I access the doc corpus using docid or docno directly? I mean, for some datasets (e.g., beir/trec-covid), I just found the dataset.get_corpus_iter() function is supported to iter the corpus, but can not get the wanted text straightforwardly.

Jia Pengyue · Answer 14 · Sun Sep 17 2023 16:27:14 GMT+0800 (China Standard Time)

Problem resolved.
The doc text can be accessed by index.getMetaIndex()