princeton-nlp / ALCE

[EMNLP 2023] Enabling Large Language Models to Generate Text with Citations. Paper: https://arxiv.org/abs/2305.14627

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Key error when loading the embedding

HCY123902 opened this issue · comments

Hello, I encountered an error when loading gtr_wikipedia_index.pkl. It only seems to contain the Git LFS details

version https://git-lfs.github.com/spec/v1
oid sha256:211b64c73c2c1fc26371b17d30a31e63017d1ca8155b70a3bc464d59a430decc
size 32279537833

I have tried getting the original archive with oid and size. However, I think this is only possible when these details are within a repository.

As constructing the dense embeddings may be too expensive considering the resources that I have, can you kindly take a look at this issue?

Thank you for the help

Hi @HCY123902, thank you for finding this bug :)
I have updated the readme to reflect the change -- the download link was incorrect.
Can you try the following command and let me know if it works?
wget https://huggingface.co/datasets/princeton-nlp/gtr-t5-xxl-wikipedia-psgs_w100-index/resolve/main/gtr_wikipedia_index.pkl

Thank you for the help. I think it works now

Just another comment related to sparse retrieval with BM25, the h.raw attribute on line 37 does not seem to exist in h.

image

When I run retrieval.py --retriever bm25, there is a corresponding error message

image

An example dict of a hitted passage is {'id': '7281f52d-a82e-11eb-8e33-e778e9b30943', 'url': 'https://www.elitedaily.com/news/politics/third-party-candidates-huge-impact/1564095', 'title': 'This Is How Third-Party Candidates Hugely Impact Elections Even When They Lose', 'sha': 'sha1:VSJGRHYGFDS3IH7IMIHEDQWXR5QB3HNY'}, which does not contain the raw key.

Did you take any extra steps to process the sparse index before running the retrieval script? I assume that getting the raw passages will take a huge space given that the sparse index already takes around 800 GB after compression

Hi @HCY123902 it should be hit.raw instead of h.raw. The hit object from LuceneSearcher has the raw attribute, which contains the passage text. I have updated the code to reflect the change. Thanks for finding the bug :)