Hannibal046 / nanoDPR

Simple replication of DPR (Dense Passage Retrieval)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

nanoDPR

Dense Passage Retrieval (DPR) is a widely-recognized technique and the foundation of retrieval-augmented LLM. The authors of the paper have provided an excellent open-source repository. However, the original repository is primarily driven by academic research and includes numerous configurable options. In contrast, this repository aims to offer a simplified replication of the DPR model on the Natural Questions dataset, allowing for a clear and straightforward understanding of DPR without compromising any details. With approximately 300 lines of code, we can train a DPR from scratch and achieve results comparable to those presented in the original paper.

In short, this repo enables:

  • training a dense retriever from scratch on Natural Question dataset
  • loading the original checkpoint provided by the official repo
  • evaluating dense retriever

Requirements

# install pytorch according to the cuda version (https://pytorch.org/get-started/previous-versions/)
# install faiss (https://github.com/facebookresearch/faiss/blob/main/INSTALL.md)
pip install transformers==4.30.2 accelerate==0.20.3 wandb wget spacy

Data

python utils/download_data.py --resource data.wikipedia_split.psgs_w100
python utils/download_data.py --resource data.retriever.nq
python utils/download_data.py --resource data.retriever.qas.nq

Training from scratch

First configure distributed setting and wandb setting:

accelerate config
wandb login

Then launch training with:

accelerate launch train_dpr.py

After training, we would get a trained query encoder and a doc encoder.

Evaluation

To evaluate the performance of retriever on the Natural Question dataset, firstly use doc encoder to encode all wikipedia passages:

## for nanoDPR
accelerate launch doc2embedding.py \
    --pretrained_model_path your/own/nanoDPR/model \
    --output_dir embedding/nanoDPR

## for official DPR
accelerate launch doc2embedding.py \
    --pretrained_model_path facebook/dpr-ctx_encoder-single-nq-base \
    --output_dir embedding/DPR

Then test DPR with:

## for nanoDPR
python test_dpr.py --embedding_dir embedding/nanoDPR --pretrained_model_path your/own/nanoDPR/model

## for official DPR
python test_dpr.py --embedding_dir embedding/DPR --pretrained_model_path facebook/dpr-question_encoder-single-nq-base

Here we provide our trained query encoder and doc encoder here.

Results

Here we show our replicated results of DPR on the NQ dataset:

Top-20 Top-100
Reported 78.4 85.4
Ours 79.1 85.9

We also report the training and evaluation cost (all experiments are conducted on 8*V100 32G):

training generate embedding build index search index
Duration 4h 29m 4h 42m 20m 58s

About

Simple replication of DPR (Dense Passage Retrieval)


Languages

Language:Python 100.0%