Key error when loading the embedding
HCY123902 opened this issue · comments
Hello, I encountered an error when loading gtr_wikipedia_index.pkl
. It only seems to contain the Git LFS details
version https://git-lfs.github.com/spec/v1
oid sha256:211b64c73c2c1fc26371b17d30a31e63017d1ca8155b70a3bc464d59a430decc
size 32279537833
I have tried getting the original archive with oid
and size
. However, I think this is only possible when these details are within a repository.
As constructing the dense embeddings may be too expensive considering the resources that I have, can you kindly take a look at this issue?
Thank you for the help
Hi @HCY123902, thank you for finding this bug :)
I have updated the readme to reflect the change -- the download link was incorrect.
Can you try the following command and let me know if it works?
wget https://huggingface.co/datasets/princeton-nlp/gtr-t5-xxl-wikipedia-psgs_w100-index/resolve/main/gtr_wikipedia_index.pkl
Thank you for the help. I think it works now
Just another comment related to sparse retrieval with BM25, the h.raw
attribute on line 37 does not seem to exist in h
.
When I run retrieval.py --retriever bm25
, there is a corresponding error message
An example dict
of a hitted passage is {'id': '7281f52d-a82e-11eb-8e33-e778e9b30943', 'url': 'https://www.elitedaily.com/news/politics/third-party-candidates-huge-impact/1564095', 'title': 'This Is How Third-Party Candidates Hugely Impact Elections Even When They Lose', 'sha': 'sha1:VSJGRHYGFDS3IH7IMIHEDQWXR5QB3HNY'}
, which does not contain the raw
key.
Did you take any extra steps to process the sparse index before running the retrieval script? I assume that getting the raw passages will take a huge space given that the sparse index already takes around 800 GB after compression
Hi @HCY123902 it should be hit.raw
instead of h.raw
. The hit object from LuceneSearcher has the raw
attribute, which contains the passage text. I have updated the code to reflect the change. Thanks for finding the bug :)