castorini / pygaggle

a gaggle of deep neural architectures for text ranking and question answering, designed for Pyserini

Home Page:http://pygaggle.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

how to botain the original dev Subset in a tsv file?

XY2323819551 opened this issue · comments

Hi, I am doing the "Experiments on MS MARCO Passage Retrieval - Dev Subset - with GPU",I want to get the original dev Subset in a tsv file(containing 105 queries),just like 《 Passage Re-ranking with BERT》provide us with "top1000.dev.tsv" and others requirements. In 《 Passage Re-ranking with BERT》,I can use convert scripts convert tsv file to tfrecord format,but it is too big, I just want to convert 105 queries,not almost 6800 queries, but how to get that Dev Subset?

commented

The data prep section has details on how to get this subset, you are downloading it. It should be in the msmarco_ans_small folder after extracting. I'll attempt to add some clarity, the dev subset we use is nothing official, it was just a randomly curated subset of MS MARCO Passage that most users can quickly run these systems on instead of running on all the 6xxx queries.

The data prep section has details on how to get this subset, you are downloading it. It should be in the msmarco_ans_small folder after extracting. I'll attempt to add some clarity, the dev subset we use is nothing official, it was just a randomly curated subset of MS MARCO Passage that most users can quickly run these systems on instead of running on all the 6xxx queries.

OK, thanks for your reply!