Error when creating index for beir/dbpedia
Jia-py opened this issue · comments
Describe the bug
Got the following error when creating index for dbpedia
Traceback (most recent call last):
File "run.py", line 178, in <module>
main(args)
File "run.py", line 30, in main
indexref = indexer.index(dataset.get_corpus_iter())
File "/home/work/.local/pyterrier/index.py", line 983, in index
ParallelIndexer.buildParallel(j_collections, self.index_dir, Indexer, Merger)
File "jnius/jnius_export_class.pxi", line 877, in jnius.JavaMethod.__call__
File "jnius/jnius_export_class.pxi", line 1060, in jnius.JavaMethod.call_staticmethod
File "jnius/jnius_utils.pxi", line 79, in jnius.check_exception
jnius.JavaException: JVM exception occurred: java.util.concurrent.ExecutionException: java.util.NoSuchElementException java.lang.RuntimeException
To Reproduce
Steps to reproduce the behavior:
- Which index - beir/dbpedia
The code I used:
dataset = pt.datasets.get_dataset('irds:beir/dbpedia-entity/test')
indexer = pt.IterDictIndexer('./index/{}'.format(args.dataset.replace('/','-')), meta={'docno':39, args.doc_field:4096}, meta_reverse=['docno','text'])
indexref = indexer.index(dataset.get_corpus_iter())
index = pt.IndexFactory.of(indexref)
Hi Jia,
I wasn't able to reproduce the issue using the PyTerrier sample code from this page: https://ir-datasets.com/beir#beir/dbpedia-entity
>>> import pyterrier as pt
>>> pt.init()
>>> dataset = pt.get_dataset('irds:beir/dbpedia-entity')
>>> indexer = pt.IterDictIndexer('./indices/beir_dbpedia-entity', meta={"docno": 200})
>>> index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
...
beir/dbpedia-entity documents: 100%|████| 4635922/4635922 [06:17<00:00, 12293.51it/s]
>>> dataset = pt.get_dataset('irds:beir/dbpedia-entity/dev')
>>> index_ref = pt.IndexRef.of('./indices/beir_dbpedia-entity') # assumes you have already built an index
>>> pipeline = pt.BatchRetrieve(index_ref, wmodel='BM25')
>>> pipeline(dataset.get_topics())
...
qid docid docno rank score query
0 INEX_LD-20120112 2007656 <dbpedia:Terminology_of_the_Vietnam_War> 0 25.399387 vietnam war facts
1 INEX_LD-20120112 2723675 <dbpedia:Leaders_of_the_Vietnam_War> 1 25.087267 vietnam war facts
2 INEX_LD-20120112 1225602 <dbpedia:List_of_songs_about_the_Vietnam_War> 2 25.045166 vietnam war facts
3 INEX_LD-20120112 2325520 <dbpedia:The_Quicksand_War:_Prelude_to_Vietnam> 3 24.620978 vietnam war facts
4 INEX_LD-20120112 1820977 <dbpedia:Legality_of_the_Vietnam_War> 4 24.605000 vietnam war facts
... ... ... ... ... ... ...
65114 TREC_Entity-17 1871347 <dbpedia:The_Hazel_Scott_Show> 995 18.324102 chefs with a show on the food network
65115 TREC_Entity-17 2361894 <dbpedia:RPM_(TV_series)> 996 18.324102 chefs with a show on the food network
65116 TREC_Entity-17 3073627 <dbpedia:Deutschlands_MeisterKoch> 997 18.320805 chefs with a show on the food network
65117 TREC_Entity-17 525996 <dbpedia:Heinz_Winkler_(chef)> 998 18.308566 chefs with a show on the food network
65118 TREC_Entity-17 1961614 <dbpedia:Matthew_Levin_(chef)> 999 18.308566 chefs with a show on the food network
[65119 rows x 6 columns]
Can you provide more details about the indexing setup that caused the error?
Thanks,
sean
We need to see the Java side of the exception. Could you try...
from jnius import JavaException
try:
index_ref = indexer.index(dataset.get_corpus_iter(), fields=['text', 'title', 'url'])
except JavaException as ja:
print('\n\t'.join(ja.stacktrace))
raise ja
Hi @cmacdonald @seanmacavaney , thanks for your reply. I changed the length of docno from 39 to 200, and it worked.
@Jia-py -- it's usually a good idea to take the PyTerrier samples from https://ir-datasets.com/. Especially to handle things like the maximum docno length.