isuco / Neural-Corpus-Indexer-NCI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[👑 NeurIPS 2022 Outstanding Paper] A Neural Corpus Indexer for Document Retrieval -- NCI (Paper)

made-with-python

What is NCI?

NCI is an end-to-end, sequence-to-sequence differentiable document retrieval model which retrieve relevant document identifiers directly for specific queries. In our evaluation on Google NQ dataset and TriviaQA dataset, NCI outperforms all baselines and model-based indexers:

Model Recall@1 Recall@10 Recall@100 MRR@100
NCI (Ensemble) 70.46 89.35 94.75 77.82
NCI (Large) 66.23 85.27 92.49 73.37
NCI (Base) 65.86 85.20 92.42 73.12
DSI (T5-Base) 27.40 56.60 -- --
DSI (T5-Large) 35.60 62.60 -- --
SEAL (Large) 59.93 81.24 90.93 67.70
ANCE (MaxP) 52.63 80.38 91.31 62.84
BM25 + DocT5Query 35.43 61.83 76.92 44.47

For more information, checkout our publications: https://arxiv.org/abs/2206.02743

Environemnt

[1] Install Anaconda.

[2] Clone repository:

git clone https://github.com/solidsea98/Neural-Corpus-Indexer-NCI.git
cd Neural-Corpus-Indexer-NCI

[3] Create conda environment:

conda env create -f environment.yml
conda activate NCI

[4] Docker:

If necessary, the NCI docker is mzmssg/corpus_env:latest.

Data Process

You can process data with NQ_dataset_Process.ipynb and Trivia_dataset_Process.ipynb.

[1] Dataset Download.

Currently NCI is evaluated on Google NQ dataset and TriviaQA dataset. Please download it before re-training.

[2] Semantic Identifier

NCI uses content-based document identifiers: A pre-trained BERT is used to generate document embeddings, and then documents are clustered using hierarchical K-means and semantic identifiers are assigned to each document. You can generate several embeddings and semantic identifiers to run NCI model for ensembling.

[3] Query Generation

In our study, Query Generation can significantly improve retrieve performance, especially for long-tail queries.

NCI uses docTTTTTquery checkpoint to generate synthetic queries. Please refer to docTTTTTquery documentation.

Find more details in NQ_dataset_Process.ipynb and Trivia_dataset_Process.ipynb.

Training

Once the data pre-processing is complete, you can launch training by train.sh. You can also launch training along with our NQ data (Download it to './Data_process/NQ_dataset/') and TriviaQA data (Download it to './Data_process/trivia_dataset/').

Evaluation

Please use infer.sh along with our NQ checkpoint or TriviaQA checkpoint (Download it to './NCI_model/logs/'). You can also inference with your own checkpoint to evaluate model performance.

Please ensemble NQ dataset or TriviaQA dataset along with our results (Download it to './NCI_model/logs/') or your own results.

Citation

If you find this work useful for your research, please cite:

@article{wang2022neural,
  title={A Neural Corpus Indexer for Document Retrieval},
  author={Wang, Yujing and Hou, Yingyan and Wang, Haonan and Miao, Ziming and Wu, Shibin and Sun, Hao and Chen, Qi and Xia, Yuqing and Chi, Chengmin and Zhao, Guoshuai and others},
  journal={arXiv preprint arXiv:2206.02743},
  year={2022}
}

Acknowledgement

We learned a lot and borrowed some code from the following projects when building NCI.

About


Languages

Language:Python 98.9%Language:Jupyter Notebook 1.0%Language:Shell 0.1%