seominjoon / denspi

Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index (DenSPI)

Home Page:https://nlp.cs.washington.edu/denspi

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to generate dense vector and sparse vector for own data

Arjunsankarlal opened this issue · comments

commented

Hi,
I believe you mean creating your own index for an arbitrary text corpus. The code is there but lacks documentation/refactoring. Working on it, please stay tuned!

commented

Hi @seominjoon, Thanks for the response. Yes, exactly I am looking for the same. Could you help me by pointing where exactly I should look at? That would be very helpful. Thanks is advance :)

Hi, is there any update on this?

I was trying to generate the sparse index for my own corpus. I assumed open/dump_tfidf.py is the script needed to do this. I am also assuming that we need to pass --sparse to open/run_pred.py to use the sparse index. But I am not sure which argument to use to pass in the generated hdf5 file to this script?

Also, what confused me is that open/run_pred.py still seems to require the wikipedia tfidf dump from DrQA (as --ranker_path). What is this used for? The doc ids here may not correspond to my corpus anymore, so will that create a problem? E.g. here: https://github.com/uwnlp/denspi/blob/master/open/mips_sparse.py#L181

I would greatly appreciate some guidance on how to run the dense + sparse index for a custom corpus.

Thank you,
Bhuwan

Hi Bhuwan,

sorry for the inconvenience. Running open/dump_tfidf.pyoutputs paragraph-level tfidf for your corpus, which should be located under args.dump_dir/tfidf folder. Note that this script uses[PAR] to split a document into paragraphs.

Also, the reason why we need DrQA is to compute document-level tfidf as they have the inverted index of whole wikipedia document. If you want to use a subset of Wikipedia for running DenSPI, you have to modify the code to map your documents to the original index in DrQA Wikipedia corpus. And, yes, it will create a problem if you use a custom corpus (not Wikipedia) in this version. You can simply remove the document-level tfidf, but it will give you a noticeable decrease in its performance (especially for QA pairs where document selection matters: e.g., SQuADopen). For custom document-level tfidf generation, see here: https://github.com/facebookresearch/DrQA/blob/master/scripts/retriever/build_tfidf.py.

We are on our way to refactor and provide more cleaner codes for custom corpus. It would take few more weeks. Thanks.

Jinhyuk

Thanks for the quick response Jinhyuk!

So to confirm if my understanding is correct, the order of documents in self.ranker.doc_mat here, should match the order in the predict file used for generating the phrase vectors passed to run_piqa.py? (Since the doc_idx seems to be inferred using an enumerate on the input docs here?).

Yes, you are correct. See here where 'doc_idx' is used for the key of hdf5 files, and here where 'doc_idx' is used to get document scores calculated fromself.ranker.doc_mat.

Hi @Arjunsankarlal and @bdhingra ,
I just updated the code and readme so that they now support running demo for custom phrase index.
Please try https://github.com/uwnlp/denspi#train and https://github.com/uwnlp/denspi#create-a-custom-phrase-index
You will be able to train with your own SQuAD-like data and host a demo with your custom document files as well.

Scaling up is detailed in https://github.com/uwnlp/denspi#create-a-large-phrase-index

It's still missing some details, which will be added soon. Thanks!