Key error when loading the embedding

Question

Key error when loading the embedding

HCY123902 opened this issue a year ago · comments

Hello, I encountered an error when loading gtr_wikipedia_index.pkl. It only seems to contain the Git LFS details

version https://git-lfs.github.com/spec/v1
oid sha256:211b64c73c2c1fc26371b17d30a31e63017d1ca8155b70a3bc464d59a430decc
size 32279537833

I have tried getting the original archive with oid and size. However, I think this is only possible when these details are within a repository.

As constructing the dense embeddings may be too expensive considering the resources that I have, can you kindly take a look at this issue?

Thank you for the help

howard-yen · Answer 1 · Mon Jul 03 2023 12:34:10 GMT+0800 (China Standard Time)

Hi @HCY123902, thank you for finding this bug :)
I have updated the readme to reflect the change -- the download link was incorrect.
Can you try the following command and let me know if it works?
wget https://huggingface.co/datasets/princeton-nlp/gtr-t5-xxl-wikipedia-psgs_w100-index/resolve/main/gtr_wikipedia_index.pkl

HCY123902 · Answer 2 · Tue Jul 04 2023 17:37:18 GMT+0800 (China Standard Time)

Thank you for the help. I think it works now

HCY123902 · Answer 3 · Tue Jul 04 2023 23:27:13 GMT+0800 (China Standard Time)

Just another comment related to sparse retrieval with BM25, the h.raw attribute on line 37 does not seem to exist in h.

When I run retrieval.py --retriever bm25, there is a corresponding error message

An example dict of a hitted passage is {'id': '7281f52d-a82e-11eb-8e33-e778e9b30943', 'url': 'https://www.elitedaily.com/news/politics/third-party-candidates-huge-impact/1564095', 'title': 'This Is How Third-Party Candidates Hugely Impact Elections Even When They Lose', 'sha': 'sha1:VSJGRHYGFDS3IH7IMIHEDQWXR5QB3HNY'}, which does not contain the raw key.

Did you take any extra steps to process the sparse index before running the retrieval script? I assume that getting the raw passages will take a huge space given that the sparse index already takes around 800 GB after compression

howard-yen · Answer 4 · Wed Jul 05 2023 00:58:25 GMT+0800 (China Standard Time)

Hi @HCY123902 it should be hit.raw instead of h.raw. The hit object from LuceneSearcher has the raw attribute, which contains the passage text. I have updated the code to reflect the change. Thanks for finding the bug :)