How to generate dense vector and sparse vector for own data

Question

How to generate dense vector and sparse vector for own data

Arjunsankarlal opened this issue 5 years ago · comments

Minjoon Seo · Answer 1 · Sat Jul 06 2019 22:18:59 GMT+0800 (China Standard Time)

Hi,
I believe you mean creating your own index for an arbitrary text corpus. The code is there but lacks documentation/refactoring. Working on it, please stay tuned!

Arjun · Answer 2 · Mon Jul 08 2019 11:35:03 GMT+0800 (China Standard Time)

Hi @seominjoon, Thanks for the response. Yes, exactly I am looking for the same. Could you help me by pointing where exactly I should look at? That would be very helpful. Thanks is advance :)

Jinhyuk Lee · Answer 3 · Mon Jul 08 2019 17:47:20 GMT+0800 (China Standard Time)

Hi @Arjunsankarlal, code for indexing starts here https://github.com/uwnlp/denspi/blob/11ff5f8d31390384c8346e82f764c3b3c4e5b819/run_piqa.py#L655
Thanks!

Bhuwan Dhingra · Answer 4 · Sat Sep 14 2019 08:45:00 GMT+0800 (China Standard Time)

Hi, is there any update on this?

I was trying to generate the sparse index for my own corpus. I assumed open/dump_tfidf.py is the script needed to do this. I am also assuming that we need to pass --sparse to open/run_pred.py to use the sparse index. But I am not sure which argument to use to pass in the generated hdf5 file to this script?

Also, what confused me is that open/run_pred.py still seems to require the wikipedia tfidf dump from DrQA (as --ranker_path). What is this used for? The doc ids here may not correspond to my corpus anymore, so will that create a problem? E.g. here: https://github.com/uwnlp/denspi/blob/master/open/mips_sparse.py#L181

I would greatly appreciate some guidance on how to run the dense + sparse index for a custom corpus.

Thank you,
Bhuwan

Jinhyuk Lee · Answer 5 · Sat Sep 14 2019 10:48:32 GMT+0800 (China Standard Time)

Hi Bhuwan,

sorry for the inconvenience. Running open/dump_tfidf.pyoutputs paragraph-level tfidf for your corpus, which should be located under args.dump_dir/tfidf folder. Note that this script uses[PAR] to split a document into paragraphs.

Also, the reason why we need DrQA is to compute document-level tfidf as they have the inverted index of whole wikipedia document. If you want to use a subset of Wikipedia for running DenSPI, you have to modify the code to map your documents to the original index in DrQA Wikipedia corpus. And, yes, it will create a problem if you use a custom corpus (not Wikipedia) in this version. You can simply remove the document-level tfidf, but it will give you a noticeable decrease in its performance (especially for QA pairs where document selection matters: e.g., SQuADopen). For custom document-level tfidf generation, see here: https://github.com/facebookresearch/DrQA/blob/master/scripts/retriever/build_tfidf.py.

We are on our way to refactor and provide more cleaner codes for custom corpus. It would take few more weeks. Thanks.

Jinhyuk

Bhuwan Dhingra · Answer 6 · Sat Sep 14 2019 22:03:06 GMT+0800 (China Standard Time)

Thanks for the quick response Jinhyuk!

So to confirm if my understanding is correct, the order of documents in self.ranker.doc_mat here, should match the order in the predict file used for generating the phrase vectors passed to run_piqa.py? (Since the doc_idx seems to be inferred using an enumerate on the input docs here?).

Jinhyuk Lee · Answer 7 · Sat Sep 14 2019 23:39:12 GMT+0800 (China Standard Time)

Yes, you are correct. See here where 'doc_idx' is used for the key of hdf5 files, and here where 'doc_idx' is used to get document scores calculated fromself.ranker.doc_mat.

Minjoon Seo · Answer 8 · Tue Oct 08 2019 23:00:33 GMT+0800 (China Standard Time)

Hi @Arjunsankarlal and @bdhingra ,
I just updated the code and readme so that they now support running demo for custom phrase index.
Please try https://github.com/uwnlp/denspi#train and https://github.com/uwnlp/denspi#create-a-custom-phrase-index
You will be able to train with your own SQuAD-like data and host a demo with your custom document files as well.

Scaling up is detailed in https://github.com/uwnlp/denspi#create-a-large-phrase-index

It's still missing some details, which will be added soon. Thanks!