terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/

Home Page:https://pyterrier.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error when creating index for beir/dbpedia

Jia-py opened this issue · comments

Describe the bug
Got the following error when creating index for dbpedia

Traceback (most recent call last):
  File "run.py", line 178, in <module>
  File "run.py", line 30, in main
    indexref = indexer.index(dataset.get_corpus_iter())
  File "/home/work/.local/pyterrier/index.py", line 983, in index
    ParallelIndexer.buildParallel(j_collections, self.index_dir, Indexer, Merger)
  File "jnius/jnius_export_class.pxi", line 877, in jnius.JavaMethod.__call__
  File "jnius/jnius_export_class.pxi", line 1060, in jnius.JavaMethod.call_staticmethod
  File "jnius/jnius_utils.pxi", line 79, in jnius.check_exception
jnius.JavaException: JVM exception occurred: java.util.concurrent.ExecutionException: java.util.NoSuchElementException java.lang.RuntimeException

To Reproduce
Steps to reproduce the behavior:

  1. Which index - beir/dbpedia

The code I used:

dataset = pt.datasets.get_dataset('irds:beir/dbpedia-entity/test')
indexer = pt.IterDictIndexer('./index/{}'.format(args.dataset.replace('/','-')), meta={'docno':39, args.doc_field:4096}, meta_reverse=['docno','text'])
indexref = indexer.index(dataset.get_corpus_iter())
index = pt.IndexFactory.of(indexref)

Hi Jia,

I wasn't able to reproduce the issue using the PyTerrier sample code from this page: https://ir-datasets.com/beir#beir/dbpedia-entity

>>> import pyterrier as pt
>>> pt.init()
>>> dataset = pt.get_dataset('irds:beir/dbpedia-entity')
>>> indexer = pt.IterDictIndexer('./indices/beir_dbpedia-entity', meta={"docno": 200})
>>> index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
beir/dbpedia-entity documents: 100%|████| 4635922/4635922 [06:17<00:00, 12293.51it/s]
>>> dataset = pt.get_dataset('irds:beir/dbpedia-entity/dev')
>>> index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
>>> pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
>>> pipeline(dataset.get_topics())
                    qid    docid                                            docno  rank      score                                  query
0      INEX_LD-20120112  2007656         <dbpedia:Terminology_of_the_Vietnam_War>     0  25.399387                      vietnam war facts
1      INEX_LD-20120112  2723675             <dbpedia:Leaders_of_the_Vietnam_War>     1  25.087267                      vietnam war facts
2      INEX_LD-20120112  1225602    <dbpedia:List_of_songs_about_the_Vietnam_War>     2  25.045166                      vietnam war facts
3      INEX_LD-20120112  2325520  <dbpedia:The_Quicksand_War:_Prelude_to_Vietnam>     3  24.620978                      vietnam war facts
4      INEX_LD-20120112  1820977            <dbpedia:Legality_of_the_Vietnam_War>     4  24.605000                      vietnam war facts
...                 ...      ...                                              ...   ...        ...                                    ...
65114    TREC_Entity-17  1871347                   <dbpedia:The_Hazel_Scott_Show>   995  18.324102  chefs with a show on the food network
65115    TREC_Entity-17  2361894                        <dbpedia:RPM_(TV_series)>   996  18.324102  chefs with a show on the food network
65116    TREC_Entity-17  3073627               <dbpedia:Deutschlands_MeisterKoch>   997  18.320805  chefs with a show on the food network
65117    TREC_Entity-17   525996                   <dbpedia:Heinz_Winkler_(chef)>   998  18.308566  chefs with a show on the food network
65118    TREC_Entity-17  1961614                   <dbpedia:Matthew_Levin_(chef)>   999  18.308566  chefs with a show on the food network

[65119 rows x 6 columns]

Can you provide more details about the indexing setup that caused the error?


We need to see the Java side of the exception. Could you try...

from jnius import JavaException

  index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
except JavaException as ja:
  raise ja

Hi @cmacdonald @seanmacavaney , thanks for your reply. I changed the length of docno from 39 to 200, and it worked.

@Jia-py -- it's usually a good idea to take the PyTerrier samples from https://ir-datasets.com/. Especially to handle things like the maximum docno length.