terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/

Home Page:https://pyterrier.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can not import the retrieval from anserini

Jia-py opened this issue · comments

Hi! Thanks for your great work!
I want to do experiments on terrier with the bm25 from anserini. But I met the following error:
jnius.JavaException: JVM exception occurred: io/anserini/eval/Qrels java.lang.NoClassDefFoundError

Here is my code:

>>> indexref = pt.IndexRef.of('./index/data.properties')
>>> index = pt.IndexFactory.of(indexref)
>>> bm25 = pt.anserini.AnseriniBatchRetrieve(index,wmodel='BM25')

I have installed the jdk11 and found the code in documentation:

trIndex = "/path/to/data.properties"
luceneIndex = "/path/to/lucene-index-dir"
BM25_ai = pt.anserini.AnseriniBatchRetrieve(luceneIndex, wmodel="BM25")

Could it be because I haven't imported the Lucene index? I'm currently using another open-source search library, beir, and have Elasticsearch running. Is there a convenient way to obtain the Lucene index that can be read in this context?

Thanks.

Did you start pt.init() with anserini in the boot_classpath? like
https://github.com/terrier-org/pyterrier/blob/master/tests/anserini/test_anserini.py#L23

I think we have only tested with 0.9.2 which is probably old now. Which version of Anserini are you using?

NoClassDefFoundError is usually because of either the JVM process was forked, or a dependency jar file was missing from the classpath. Because we ask for the fatjar, it should be included.

Can you show the Python stack trace? A colab with mimium working example would be helpful.

Thanks for your reply.
Yes, I started pt.init() with anserini 0.9.2 and here is the python stack trace:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyterrier/anserini.py", line 68, in __init__
    from pyserini.search import SimpleSearcher
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/__init__.py", line 17, in <module>
    from ._base import JQuery, JQueryGenerator, JDisjunctionMaxQueryGenerator, get_topics,\
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/_base.py", line 35, in <module>
    JQrels = autoclass('io.anserini.eval.Qrels')
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/jnius/reflect.py", line 209, in autoclass
    c = find_javaclass(clsname)
  File "jnius/jnius_export_func.pxi", line 22, in jnius.find_javaclass
  File "jnius/jnius_utils.pxi", line 79, in jnius.check_exception
jnius.JavaException: JVM exception occurred: io/anserini/eval/Qrels java.lang.NoClassDefFoundError

Ah, I forgot pyserini is involved. I think the pyserini version has to match the anserini version. Our unit tests use pyserini==0.9.4.

If your anserini index is newer than that, then you can try upgrading. I'm happy to have a PR for more recent pyserini support, but its not something we use ourselves.

Thank you, I'll have a try.

(I'm also thinking that Anserini support could move from Pyterrier itself into a smaller separate repo, just like we do for pyterrier_colbert etc). That would enable better unit testing etc.

Let me know how you get on.

I degrade the pyterrier to 0.9.4, and download the lucene index from pyserini, I met this error.

>>> luceneIndex = "/root/.cache/pyserini/indexes/lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.57b812594b11d064a23123137ae7dade"
>>> BM25_ai = pt.anserini.AnseriniBatchRetrieve(luceneIndex, wmodel="BM25")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyterrier/anserini.py", line 69, in __init__
    self.searcher = SimpleSearcher(index_location)
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/_searcher.py", line 48, in __init__
    self.object = JSimpleSearcher(JString(index_dir))
  File "jnius/jnius_export_class.pxi", line 270, in jnius.JavaClass.__init__
  File "jnius/jnius_export_class.pxi", line 384, in jnius.JavaClass.call_constructor
  File "jnius/jnius_utils.pxi", line 79, in jnius.check_exception
jnius.JavaException: JVM exception occurred: Format version is not supported (resource BufferedChecksumIndexInput(MMapIndexInput(path="/root/.cache/pyserini/indexes/lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.57b812594b11d064a23123137ae7dade/segments_1"))): 10 (needs to be between 7 and 9) org.apache.lucene.index.IndexFormatTooNewException

And I upgrade the anserini by pt.init(boot_packages=["io.anserini:anserini:0.22.0:fatjar"])

>>> luceneIndex = "/root/.cache/pyserini/indexes/lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.57b812594b11d064a23123137ae7dade"
>>> BM25_ai = pt.anserini.AnseriniBatchRetrieve(luceneIndex, wmodel="BM25")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyterrier/anserini.py", line 69, in __init__
    self.searcher = SimpleSearcher(index_location)
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/_searcher.py", line 49, in __init__
    self.num_docs = self.object.getTotalNumDocuments()
AttributeError: 'io.anserini.search.SimpleSearcher' object has no attribute 'getTotalNumDocuments'

I degrade the pyterrier to 0.9.4,

You mean pyserini to 0.9.4?

jnius.JavaException: JVM exception occurred: Format version is not supported (resource BufferedChecksumIndexInput(MMapIndexInput(path="/root/.cache/pyserini/indexes/lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.57b812594b11d064a23123137ae7dade/segments_1"))): 10 (needs to be between 7 and 9) org.apache.lucene.index.IndexFormatTooNewException

This error is because anserini and pyserini is too old for your index. So newer is needed

AttributeError: 'io.anserini.search.SimpleSearcher' object has no attribute 'getTotalNumDocuments'

This error is because your pyserini version does not match your anserini fat jar. You have to keep them in sync somehow.

Thanks for your reply! I followed your advice and used anserini and pyserini both at 0.22.0. There is still something wrong with it.

>>> luceneIndex = "/root/.cache/pyserini/indexes/lucene-index.beir-v1.0.0-trec-covid.flat.20221116.505594.57b812594b11d064a23123137ae7dade"
>>> BM25_ai = pt.anserini.AnseriniBatchRetrieve(luceneIndex, wmodel="BM25")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyterrier/anserini.py", line 68, in __init__
    from pyserini.search import SimpleSearcher
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/__init__.py", line 19, in <module>
    from .lucene import JLuceneSearcherResult, LuceneSimilarities, LuceneFusionSearcher, LuceneSearcher
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/lucene/__init__.py", line 18, in <module>
    from ._impact_searcher import JImpactSearcherResult, LuceneImpactSearcher, SlimSearcher
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/search/lucene/_impact_searcher.py", line 34, in <module>
    from pyserini.index import Document
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/index/__init__.py", line 21, in <module>
    from .lucene._base import Document, Generator, IndexTerm, Posting, IndexReader
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/index/lucene/__init__.py", line 17, in <module>
    from ._base import Document, Generator, IndexTerm, Posting, IndexReader
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/index/lucene/_base.py", line 30, in <module>
    from pyserini.analysis import get_lucene_analyzer, JAnalyzer, JAnalyzerUtils
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/analysis/__init__.py", line 17, in <module>
    from ._base import get_lucene_analyzer, Analyzer, JAnalyzer, JAnalyzerUtils, JDefaultEnglishAnalyzer, JWhiteSpaceAnalyzer
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/pyserini/analysis/_base.py", line 26, in <module>
    JDanishAnalyzer = autoclass('org.apache.lucene.analysis.da.DanishAnalyzer')
  File "/opt/conda/envs/terrier/lib/python3.8/site-packages/jnius/reflect.py", line 209, in autoclass
    c = find_javaclass(clsname)
  File "jnius/jnius_export_func.pxi", line 22, in jnius.find_javaclass
  File "jnius/jnius_utils.pxi", line 79, in jnius.check_exception
jnius.JavaException: JVM exception occurred: Bad type on operand stack
Exception Details:
  Location:
    org/apache/lucene/analysis/da/DanishAnalyzer.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents; @65: invokespecial
  Reason:
    Type 'org/tartarus/snowball/ext/DanishStemmer' (current frame, stack[3]) is not assignable to 'org/tartarus/snowball/SnowballStemmer'
  Current Frame:
    bci: @65
    flags: { }
    locals: { 'org/apache/lucene/analysis/da/DanishAnalyzer', 'java/lang/String', 'org/apache/lucene/analysis/Tokenizer', 'org/apache/lucene/analysis/TokenStream' }
    stack: { uninitialized 53, uninitialized 53, 'org/apache/lucene/analysis/TokenStream', 'org/tartarus/snowball/ext/DanishStemmer' }
  Bytecode:
    0000000: bb00 0959 b700 0a4d bb00 0b59 2cb7 000c
    0000010: 4ebb 000d 592d 2ab4 000e b700 0f4e 2ab4
    0000020: 0008 b600 109a 0010 bb00 1159 2d2a b400
    0000030: 08b7 0012 4ebb 0013 592d bb00 1459 b700
    0000040: 15b7 0016 4ebb 0017 592c 2db7 0018 b0  
  Stackmap Table:
    append_frame(@53,Object[#57],Object[#58])
 java.lang.VerifyError

Gosh, I have never seen this error before.

I think, maybe, that Terrier ships with one of Snowballs' Danish stemmer, and Lucene ships with another. If this is the case, a bit of hacking will be needed to address this.

Have you considered just using the results file output from anserini, and using pt.Transformer.from_df(pt.io.read_results(file)) instead.

Thank you for your advice! But I didn't generate any result files with Pyserini.

I decided to just use pyterrier to finish my work now. By the way, after retrieving, I got a dataframe that contains fields such as qid, docid, docno, rank, score, and query. How can I access the doc corpus using docid or docno directly? I mean, for some datasets (e.g., beir/trec-covid), I just found the dataset.get_corpus_iter() function is supported to iter the corpus, but can not get the wanted text straightforwardly.

Problem resolved.
The doc text can be accessed by index.getMetaIndex()