A curated list of awesome papers related to pre-trained models for information retrieval. Any feedback and contribution are welcome!
- Survey paper
- Phase 1: First-stage retrieval
- Phase 2: Re-ranking stage
- Multimodel Retrieval
- Other Resources
We also include the recent Multimodel Pre-training works whose pre-trained models fine-tuned on the cross-modal retrieval tasks in their experiments.
For people who want to acquire some basic&advanced knowledge about neural models for information retrieval and try some neural models by hand, we refer readers to the below awesome NeuIR survey and the text-matching toolkit MatchZoo-py:
- A Deep Look into neural ranking models for information retrieval. Jiafeng Guo et.al.
Pre-trained models
- Pre-trained Models for Natural Language Processing: A Survey. Xipeng Qiu et.al.
Pre-trained models for information retrieval
- Pretrained Transformers for Text Ranking: BERT and Beyond. Jimmy Lin et.al.
- Neural term weighting framework
- Design new pre-training tasks for retrieval
- Decouple the encoding of query and document
- Context-Aware Term Weighting For First Stage Passage Retrieval. Zhuyun Dai et.al. SIGIR 2020 short. [code] (DeepCT)
- Context-Aware Document Term Weighting for Ad-Hoc Search Zhuyun Dai et.al. WWW 2020. [code] (HDCT)
- Document Expansion by Query Prediction. Rodrigo Nogueira et.al. [doc2query code,docTTTTTquery code] (doc2query, docTTTTTquery)
- Latent Retrieval for Weakly Supervised Open Domain Question Answering. Kenton Lee et.al. ACL 2019. [code] (ORQA, ICT)
- Pre-training tasks for embedding-based large scale retrieva. Wei-Cheng Chang et.al. ICLR 2020. (ICT, BFS and WLP)
- REALM: Retrieval-Augmented Language Model Pre-Training. Kelvin Guu, Kenton Lee et.al. ICML 2020. [code] (REALM)
In traditional ad-hoc retrieval
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Omar Khattab et.al. SIGIR 2020. [code] (ColBERT)
- Efficient Document Re-Ranking for Transformers by Precomputing Term Representations. Sean MacAvaney et.al. SIGIR 2020. [code] (PreTTR)
- Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. Samuel Humeau,Kurt Shuster et.al. ICLR 2020. [code] (Poly-encoders)
- Modularized Transfomer-based Ranking Framework Luyu Gao et.al. EMNLP 2020. [code] (MORES, similar to Poly-encoders)
- Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. Lee Xiong, Chenyan Xiong et.al. [code] (ANCE)
- RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. Jingtao Zhan et.al. [code] (RepBERT)
- RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. Yingqi Qu et.al. (RocketQA)
In open domain question answering
- Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. Minjoon Seo,Jinhyuk Lee et.al. ACL 2019. [code] (DENSPI)
- Dense Passage Retrieval for Open-Domain Question Answering. Vladimir Karpukhin,Barlas Oguz et.al. EMNLP 2020 [code] (DPR)
- Contextualized Sparse Representations for Real-Time Open-Domain Question Answering. Jinhyuk Lee, Minjoon Seo et.al. ACL 2020. [code] (SPARC, sparse vectors)
- DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding. Yuyu Zhang, Ping Nie et.al. SIGIR 2020 short. (DC-BERT)
- Distilling Knowledge from Reader to Retriever for Question Answering. Gautier Izacard1 et.al. ICLR 2021.
- Learning Dense Representations of Phrases at Scale. Jinhyuk Lee, Danqi Chen et.al. ArxiV 2021. [code] (DensePhrases)
- Directly apply pre-trained models to IR
- Design new pre-training tasks for reranking
- Modify on top of the existing pre-trained models
- Passage Re-ranking with BERT. Rodrigo Nogueira et.al. [code] (monoBERT: Maybe the first work on applying BERT to IR)
- Multi-Stage Document Ranking with BERT. Rodrigo Nogueira et.al. (duoBERT: pointwise+pairwise)
- Simple Applications of BERT for Ad Hoc Document Retrieval. / Applying BERT to Document Retrieval with Birch. Wei Yang, Haotian Zhang et.al. / Zeynep Akkalyoncu Yilmaz et.al. EMNLP 2019 short. [code] (Birch: Sentence-level)
- Deeper Text Understanding for IR with Contextual Neural Language Modeling. Zhuyun Dai et.al. SIGIR 2020 short. [code] (BERT-MaxP, BERT-firstP, BERT-sumP: Passage-level)
- CEDR: Contextualized Embeddings for Document Ranking. Sean MacAvaney et.al. SIGIR 2020 short. [code] (CEDR: BERT+ranking model)
- Training Curricula for Open Domain Answer Re-Ranking. Sean MacAvaney et.al. SIGIR 2020. [code] (curriculum learning based on BM25)
- Leveraging Passage-level Cumulative Gain for Document Ranking. Zhijing Wu et.al. WWW 2020. (PCGM)
- Selective Weak Supervision for Neural Information Retrieval. Kaitao Zhang et.al. WWW 2020. [code] (ReInfoSelect)
- Document Ranking with a Pretrained Sequence-to-Sequence Model. Rodrigo Nogueira, Zhiying Jiang et.al. EMNLP 2020. [code] (using T5)
- Beyond [CLS] through Ranking by Generation. Cicero Nogueira dos Santos et.al. EMNLP 2020 short. (query likelihood computed by GPT)
- BERT-QE: Contextualized Query Expansion for Document Re-ranking. Zhi Zheng et.al. EMNLP 2020 Findings. [code] (BERT-QE)
- Cross-lingual Retrieval for Iterative Self-Supervised Training. Chau Tran et.al. NIPS 2020. [code] (CRISS)
- A Linguistic Study on Relevance Modeling in Information Retrieval. Yixing Fan, Jiafeng Guo et.al. WWW 2021. (Prob & Intervention)
- Generalizing Discriminative Retrieval Models using Generative Tasks. Bingsheng Liu, Hamed Zamani et.al. WWW 2021. (GDMTL,joint discriminative and generative model with multitask learning)
- PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval. Xinyu Ma et.al. WSDM 2021. [code] (PROP)
- Local Self-Attention over Long Text for Efficient Document Retrieval. Sebastian Hofstätter et.al. SIGIR 2020 short. [code] (TKL:Transformer-Kernel for long text)
- The Cascade Transformer: an Application for Efficient Answer Sentence Selection. Luca Soldaini et.al. ACL 2020.[code] (Cascade Transformer)
- Using Prior Knowledge to Guide BERT’s Attention in Semantic Textual Matching Tasks. Tingyu Xia et.al. WWW 2021.[code] (Text Matching: Guide BERT's Attention)
- Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. Gen Li, Nan Duan et.al. AAAI 2020. [code] (Unicoder-VL)
- XGPT: Cross-modal Generative Pre-Training for Image Captioning. Qiaolin Xia, Haoyang Huang, Nan Duan et.al. Arxiv 2020. [code] (XGPT)
- UNITER: UNiversal Image-TExt Representation Learning. Yen-Chun Chen, Linjie Li et.al. ECCV 2020. [code] (UNITER)
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. Xiujun Li, Xi Yin et.al. ECCV 2020. [code] (Oscar)
- VinVL: Making Visual Representations Matter in Vision-Language Models. Pengchuan Zhang, Xiujun Li et.al. ECCV 2020. [code] (VinVL)
- ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Jiasen Lu, Dhruv Batra et.al. NeurIPS 2019. [code] (VilBERT)
- 12-in-1: Multi-Task Vision and Language Representation Learning. Jiasen Lu, Dhruv Batra et.al. CVPR 2020. [code] (A multi-task model based on VilBERT)
- Learning Transferable Visual Models From Natural Language Supervision. Alec Radford et.al. CVPR 2020. [code] (CLIP, GPT team)
- ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. Fei Yu, Jiji Tang et.al. Arxiv 2020. [code] (ERNIE-ViL,1st place on the VCR leaderboard)
- M6-v0: Vision-and-Language Interaction for Multi-modal Pretraining. Junyang Lin, An Yang et.al. KDD 2020. (M6-v0/InterBERT)
- M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training. Haoyang Huang, Lin Su et.al. CVPR 2021. [code] (M3P, MILD dataset)
- Efficient Transformers: A Survey. Yi Tay, Mostafa Dehghani et.al. Arxiv 2020.