jordane95 / dual-cross-encoder

Dual Cross Encoder for Dense Retrieval

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dual Cross Encoder

Learning Diverse Document Representations with Deep Query Interactions for Dense Retrieval

Environment Setup

Make sure you have a python>=3.7 env with pytorch installed. Then run the following command to setup the environment.

pip install -e .

Experiments

Training

There are two ways to train the model. One uses the query generation as data augmentation and the other does not.

Note: Our models are trained on 8 V100 GPUs with 32G memory. If you use differenty configurations, please change the parameters in the training scripts accordingly.

  • w/o data augmentation
MODEL_DIR=/path/to/save/model
bash scripts/train.sh $MODEL_DIR
  • w/ data augmentation
PRETRAINED_MODEL_DIR=/path/to/save/pretrained/model
MODEL_DIR=/path/to/save/model
bash scripts/pretrain_corpus.sh $PRETRAINED_MODEL_DIR
bash scripts/finetune.sh $MODEL_DIR $PRETRAINED_MODEL_DIR

Encoding

The following code encode the corpus into vectors. The corpus is partitioned into 20 shards due to resource limit.

ENCODE_DIR=/path/to/save/encoding
# encode corpus
for i in $(seq 0 19)
do
bash scripts/encode_corpus_with_query_shard.sh $ENCODE_DIR $i $MODEL_DIR
done

Retrieval

We evaluate the retrieval performance on the following two benchmarks.

  • MS MARCO
RESULT_DIR=/path/to/save/result

# encode query
bash scripts/encode_dev_query.sh $ENCODE_DIR $MODEL_DIR

# shard search
for i in $(seq 0 19)
do
bash scripts/search_shard.sh $ENCODE_DIR $i
done

# reduce
bash scripts/reduce.sh $ENCODE_DIR $RESULT_DIR

# evaluation
bash scripts/evaluate.sh $RESULT_DIR
  • TREC DL
YEAR=2019 # 2020
RESULT_DIR=/path/to/save/result

# encode query
bash scripts/encode_trec_query.sh $ENCODE_DIR $MODEL_DIR $YEAR

# shard search
for i in $(seq 0 19)
do
bash scripts/search_trec_shard.sh $ENCODE_DIR $i $YEAR
done

# reduce
bash scripts/reduce_trec.sh $ENCODE_DIR $RESULT_DIR $YEAR

# evaluation
bash scripts/evaluate_trec.sh $RESULT_DIR $YEAR

Acknowledgement

The code is mainly based on the Tevatron toolkit. We also used some code and data from docTTTTTquery, beir and transformers. Thanks for the great work!

About

Dual Cross Encoder for Dense Retrieval

License:Apache License 2.0


Languages

Language:Python 96.4%Language:Shell 3.6%