Code for the ECIR 2024 paper "Shallow Cross-Encoders for Low-Latency Retrieval".
Aleksandr V. Petrov
Sean MacAvaney
Craig Macdonald
If you use this code, please consider citing the paper::
@article{petrov2023shallow,
title={Shallow Cross-Encoders for Low-Latency Retrieval},
author={Petrov, Aleksandr and Macdonald, Craig and MacAvaney, Sean},
@booktitle={European Conference on Information Retrieval}
year={2024}
}
To run the code, please install the following dependencies: pytorch, Hugging Face Transformers, pyterrier, pyterrier-pisa, ir-datasets, ir-measures, tensorboard. You can install the requirements using pip:
pip3 install -r requirements.txt
Note that pyterrier also depends on Java installation and requires the JAVA_HOME
environment variable to point to the Java installation.
Before running the training code, run the
Before training shallow cross-encoders, run the bm25ids2tensor.py
script. The script extracts 1000 candidate documents for each query in the MS-MARCO trainset and pre-tokenizes them. This allows not to spend time on 1st stage retrieval during training.
Cross encoders can be trained using the command:
python3 train_shallow_crossencoder.py
Some useful parameters:
--backbone-model | backbone model; 'prajjwal1/bert-tiny, 'prajjwal1/bert-mini' or 'prajjwal1/bert-small' |
---|---|
-t | Parameter t for the gBCE loss; we recommend to set it to 0.75 |
--negs | Number of negatives per positive; defaults to 16 |
The training script will spawn Tensorboard on port 26006. During training, you can monitor model metrics in the Tensorbord interface.
To compare with the baselines, run python3 with evaluate_tinybert.py
. When evaluating, make sure that you've replaced the model checkpoints specified in the evaluation code.