luyug / Condenser

EMNLP 2021 - Pre-training architectures for dense retrieval

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

transfer to msmarco document dataset

Berlin-98 opened this issue · comments

Hi~
I am using this repo to do experiment on msmacro document dataset, but i feel a little confuse about the difference between repos of Condenser, tevatron and coCondenser. I follow the guide of "coCondenser MS-MARCO Passage Retrieval" and try to transfer the data to msmacro document dataset and the checkpoint to condenser. I think if i want reproduce the result of the coCondenser paper, i just need to encode and then Index Search? is that right? If i want to transfer the data to marco document and the condenser checkpoint, i need to follow the steps of finetuning stage one and two? first finetune a checkpoint and save to retriever_model_s1/ and then use the trained checkpoint to mining hard negatives and then use the hard negatives to further finetune the model and save to retriever_model_s2, and finally search the result of dev set? is that right