RoDR

This repository contains the code and resources for our paper:

Xuanang Chen, Jian Luo, Ben He, Le Sun, Yingfei Sun. Towards Robust Dense Retrieval via Local Ranking Alignment. In IJCAI 2022.

Installation

Our code is developed based on Tevatron DR training toolkit. We recommend you to create a new conda environment conda create -n rodr python=3.7, activate it conda activate rodr, and then install the following packages: torch==1.8.1, faiss-cpu==1.7.1, transformers==4.9.2, datasets==1.11.0.

Query Variations

Note: In this repo, we mainly take MS MARCO passage ranking dataset for example. Before the experiments,
you can refer to download_raw_data.sh script to download and process the raw data, which will be saved in the data/msmarco_passage/raw folder, like train.negatives.tsv file that contains the negatives of each train query for constructing the training data.

Dev Query: All query variation sets for MS MARCO small Dev set used in our paper are provided in the data/msmarco_passage/query/dev folder. You can directly use these query variation sets to test the robustness of your DR model, and you can also use query_variation_generation.py script to generate a query variation set by yourself:

qv_type=MisSpell
python query_variation_generation.py 
--original_query_file ./msmarco_passage/raw/queries.dev.small.tsv
--query_variation_file ./msmarco_passage/process/query/dev/queries.dev.small.${qv_type}.tsv
--variation_type ${qv_type}

You need to appoint the type of query variation (namely, qv_type) from pre-defined eight types of query variations: MisSpell, ExtraPunc, BackTrans, SwapSyn_Glove, SwapSyn_WNet, TransTense, NoStopword, SwapWords. Note that a few queries can be kept original in a certain query variation set. For example, if one query does not contain any stopword, the NoStopword variation is not applicable. Besides, before using the query_variation_generation.py script, you may need to install TextFlint, TextAttack, NLTK toolkits.

Train Query: We also need to generate variations for train queries to enhance the DR model. Similar to Dev set, we first generate eight variation sets for the train query set, and then merge them uniformly to obtain the final train query variation set (our generated train query variation file is available in the data/msmarco_passage/query/train folder), which is used to insert variations into the training data, by adding a 'query_variation' field into each training examples. You can refer to construct_train_query_variations.py script after you obtain train variation sets and original training data.

Training

Standard DR: To obtain a standard DR model, like DR_OQ in our paper, you need to construct the training data first:

OQ: the training data with original train queries, generated by bulid_train.py script.
QV: the training data with train query variations, by inserting the variation version of original train queries into the OQ training data.

After that, you can refer to train_standard_dpr.sh script, to train the DR_OQ, DR_QV, and DR_OQ->QV models using the OQ and QV training data as described in our paper.

RoDR: As for our proposed RoDR model, to achieve better alignment, you need to collect nearer neighbors for queries. Specifically, you can update the negatives in the OQ training data by sampling from the top candidates returned by DR_OQ model. After that, you can refer to bulid_train_nn.py script, wherein --query_variation argument requires the generated train query variation file. Certainly, you can also add the variation version of train queries after constructing the training data, similar to QV, using construct_training_data_with_variations function available in the construct_train_query_variations.py script.

After that, you can refer to train_rodr_dpr.sh script, to train a RoDR w/ DR_OQ model on top of the DR_OQ model. Compared to standard DR training, you need to change --training_mode to oq.qv.lra mode, provide the initial DR model path to --model_name_or_path argument, and set the loss weights in Eq. 8, as described in our paper.

Retrieval

After training a DR model, you can use it to carry out dense retrieval as follows:

Tokenizing: using tokenize_passages.py and tokenize_queries.py scripts to tokenize all passages in the corpus, the original queries and query variations.
Encoding and Retrieval: refer to encode_retrieve_dpr.sh to first encode passages and queries into vectors, and then use Faiss to index and retrieve.

As for zero-shot retrieval on ANTIQUE, all DR models are only trained on MS MARCO passage dataset, please refer to run_antique_zeroshot.sh script.

For the evaluation on MS MARCO passage ranking dataset, such as MRR@10, Recall, and statistical t-test, we provide variations_avg_tt_test.py script to compute the metrics for all paired run files from two DR models waiting for comparison. You can use it like this:

# for single run file
python variations_avg_tt_test.py qrels run_file1 run_file2
# for all run files
python variations_avg_tt_test.py qrels run_dir1 run_dir2 fusion

Resources

Query variations:
- Passage-Dev: available in the data/msmarco_passage/query folder, for both dev and train query sets.
- Document-Dev: available in the data/msmarco_doc/query folder, for both dev and train query sets.
- ANTIQUE: available in the data/antique/query folder, which are collected from five types of manually validated query variations.
Models:

MS MARCO Passage MS MARCO Document

DR_OQ DR_OQ

DR_QV DR_QV

DR_OQ->QV DR_OQ->QV

RoDR w/ DR_OQ RoDR w/ DR_OQ
Retrieval files^*:

Dataset DR_OQ RoDR w/ DR_OQ

Passage-Dev Download Download

Document-Dev Download Download

ANTIQUE Download Download

^* Due to the large size of run files on Passage-Dev, we only provide the run files of DR_OQ and RoDR w/ DR_OQ models. If you want to obtain the run files of DR_QV and DR_OQ->QV models, please feel free to contact us.

MS MARCO Passage	MS MARCO Document
DR_OQ	DR_OQ
DR_QV	DR_QV
DR_OQ->QV	DR_OQ->QV
RoDR w/ DR_OQ	RoDR w/ DR_OQ

Dataset	DR_OQ	RoDR w/ DR_OQ
Passage-Dev	Download	Download
Document-Dev	Download	Download
ANTIQUE	Download	Download

RoDR on existing DR models

If you want to apply RoDR to publicly available DR models, such as ANCE, TAS-Balanced and ADORE+STAR, which are enhanced in our paper, you need to make some minor changes in the model level, such as adding the pooler in ANCE, and using separate query and passage encoders in ADORE+STAR. Herein, we provide the model checkpoints and retrieval files for the reproducibility of our experiments and other research uses.

Models:

Original RoDR

ANCE RoDR w/ ANCE

TAS-Balanced RoDR w/ TAS-Balanced

ADORE+STAR RoDR w/ ADORE+STAR
Retrieval files^**:

Model Passage-Dev ANTIQUE

RoDR w/ ANCE Download Download

RoDR w/ TAS-Balanced Download Download

RoDR w/ ADORE+STAR Download Download

^** Due to the large size of run files on Passage-Dev, we only provide the run files of RoDR models. If you want to obtain the run files of original existing DR models, please feel free to contact us.

Original	RoDR
ANCE	RoDR w/ ANCE
TAS-Balanced	RoDR w/ TAS-Balanced
ADORE+STAR	RoDR w/ ADORE+STAR

Model	Passage-Dev	ANTIQUE
RoDR w/ ANCE	Download	Download
RoDR w/ TAS-Balanced	Download	Download
RoDR w/ ADORE+STAR	Download	Download

Citation

If you find our paper/resources useful, please cite:

@inproceedings{chen_ijcai2022-275,
  title     = {Towards Robust Dense Retrieval via Local Ranking Alignment},
  author    = {Xuanang Chen and
               Jian Luo and
               Ben He and
               Le Sun and
               Yingfei Sun},
  booktitle = {Proceedings of the Thirty-First International Joint Conference on
               Artificial Intelligence, {IJCAI-22}},
  publisher = {International Joint Conferences on Artificial Intelligence Organization},
  pages     = {1980--1986},
  year      = {2022}
}

cxa-unique / RoDR