Code for the ECIR 2024 paper "Shallow Cross-Encoders for Low-Latency Retrieval".

Credits:

Aleksandr V. Petrov
Sean MacAvaney
Craig Macdonald

If you use this code, please consider citing the paper::

@article{petrov2023shallow,
  title={Shallow Cross-Encoders for Low-Latency Retrieval},
  author={Petrov, Aleksandr and Macdonald, Craig and MacAvaney, Sean},
  @booktitle={European Conference on Information Retrieval}
  year={2024}
}

Instructions

Dependencies

To run the code, please install the following dependencies: pytorch, Hugging Face Transformers, pyterrier, pyterrier-pisa, ir-datasets, ir-measures, tensorboard. You can install the requirements using pip:

pip3 install -r requirements.txt

Note that pyterrier also depends on Java installation and requires the JAVA_HOME environment variable to point to the Java installation.

Preparing data

Before running the training code, run the Before training shallow cross-encoders, run the bm25ids2tensor.py script. The script extracts 1000 candidate documents for each query in the MS-MARCO trainset and pre-tokenizes them. This allows not to spend time on 1st stage retrieval during training.

Running training

Cross encoders can be trained using the command:

python3 train_shallow_crossencoder.py

Some useful parameters:

--backbone-model	backbone model; 'prajjwal1/bert-tiny, 'prajjwal1/bert-mini' or 'prajjwal1/bert-small'
-t	Parameter t for the gBCE loss; we recommend to set it to 0.75
--negs	Number of negatives per positive; defaults to 16

Monitoring training

The training script will spawn Tensorboard on port 26006. During training, you can monitor model metrics in the Tensorbord interface.

Evaluation

To compare with the baselines, run python3 with evaluate_tinybert.py. When evaluating, make sure that you've replaced the model checkpoints specified in the evaluation code.

seanmacavaney / shallow-cross-encoders