HugoZHL / Neural-Corpus-Indexer-NCI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A Neural Corpus Indexer for Document Retrieval (NCI)

made-with-python

What is NCI?

NCI is an end-to-end, sequence-to-sequence differentiable document retrieval model which retrieve relevant document identifiers directly for specific queries. In our evaluation on Google NQ dataset and TriviaQA dataset, NCI outperforms all baselines and model-based indexers:

Model Recall@1 Recall@10 Recall@100 MRR@100
NCI (Ensemble) 70.46 89.35 94.75 77.82
NCI (Large) 66.23 85.27 92.49 73.37
NCI (Base) 65.86 85.20 92.42 73.12
DSI (T5-Base) 27.40 56.60 -- --
DSI (T5-Large) 35.60 62.60 -- --
SEAL (Large) 59.93 81.24 90.93 67.70
ANCE (MaxP) 52.63 80.38 91.31 62.84
BM25 + DocT5Query 35.43 61.83 76.92 44.47

For more information, checkout our publications: https://arxiv.org/abs/2206.02743

Environemnt

[1] Install Anaconda.

[2] Clone repository:

git clone https://github.com/solidsea98/Neural-Corpus-Indexer-NCI.git
cd Neural-Corpus-Indexer-NCI

[3] Create conda environment:

conda env create -f environment.yml
conda activate NCI

[4] Docker:

If necessary, the NCI docker is mzmssg/corpus_env:latest.

Data Process

You can process data with NQ_dataset_Process.ipynb and Trivia_dataset_Process.ipynb.

Dataset Download.

Currently NCI is evaluated on Google NQ dataset and TriviaQA dataset. Please download it before re-training.

Semantic Identifier

NCI uses content-based document identifiers: A pre-trained BERT is used to generate document embeddings, and then documents are clustered using hierarchical K-means and semantic identifiers are assigned to each document. You can generate several embeddings and semantic identifiers to run NCI model for ensembling.

Please find more details in NQ_dataset_Process.ipynb and Trivia_dataset_Process.ipynb.

Query Generation

In our study, Query Generation can significantly improve retrieve performance, especially for long-tail queries.

NCI uses docTTTTTquery checkpoint to generate synthetic queries. Please refer to docTTTTTquery documentation and find more details in NQ_dataset_Process.ipynb and Trivia_dataset_Process.ipynb.

Training

Once the data pre-processing is complete, you can launch training by train.sh. You can also launch training along with our NQ data (Download it to './Data_process/NQ_dataset/') and TriviaQA data (Download it to './Data_process/trivia_dataset/').

Evaluation

Please use infer.sh along with our NQ checkpoint or TriviaQA checkpoint (Download it to './NCI_model/logs/'). You can also inference with your own checkpoint to evaluate model performance.

Please ensemble NQ dataset or TriviaQA dataset along with our results (Download it to './NCI_model/logs/') or your own results.

Acknowledgement

We learned a lot and borrowed some code from the following projects when building NCI.

About


Languages

Language:Python 98.2%Language:Jupyter Notebook 1.7%Language:Shell 0.1%