castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

Home Page:http://pyserini.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to get raw content

BeastyZ opened this issue · comments

Hi, I'm trying to print the raw content of beir-v1.0.0-scifact.flat, but failed. However, I made it on msmarco-v1-passage like Hyde. Below is my code:

from pyserini.search import FaissSearcher, LuceneSearcher
corpus = LuceneSearcher.from_prebuilt_index('beir-v1.0.0-scifact.flat')
print(corpus.doc("0").raw())

Then, I got the error:
AttributeError: 'NoneType' object has no attribute 'raw'

I've found the solution

hey man what's your solution? did you index using faiss or lucene? I'm confused that we can't get raw content from faiss index, so I guess lucene is a must even if we search by faiss or lucene? @@

hey man what's your solution? did you index using faiss or lucene? I'm confused that we can't get raw content from faiss index, so I guess lucene is a must even if we search by faiss or lucene? @@

I think it has nothing to do with faiss or lucene. Whether it is prebuilt index or self-built index, you have to use the correct docid to get the raw content. If you index by yourself and want to get the raw content, maybe you can refer to usage-index

I get it, tks!