ericdoug-qi / awesome-pretrained-models-for-information-retrieval

A curated list of awesome papers related to pre-trained models for information retrieval.

awesome-pretrained-models-for-information-retrieval

A curated list of awesome papers related to pre-trained models for information retrieval. Any feedback and contribution are welcome!

Table of Contents

Survey paper
Phase 1: First-stage retrieval
Phase 2: Re-ranking stage
Multimodel Retrieval
- Unified Single-stream Architecture
- Multi-stream Architecture Applied on Input
Other Resources

We also include the recent Multimodel Pre-training works whose pre-trained models fine-tuned on the cross-modal retrieval tasks in their experiments.

For people who want to acquire some basic&advanced knowledge about neural models for information retrieval and try some neural models by hand, we refer readers to the below awesome NeuIR survey and the text-matching toolkit MatchZoo-py:

A Deep Look into neural ranking models for information retrieval. Jiafeng Guo et.al.

Survey Paper

Pre-trained models

Pre-trained Models for Natural Language Processing: A Survey. Xipeng Qiu et.al.

Pre-trained models for information retrieval

Pretrained Transformers for Text Ranking: BERT and Beyond. Jimmy Lin et.al.

First Stage Retrieval

Neural term weighting framework
Design new pre-training tasks for retrieval
Decouple the encoding of query and document

Neural term weighting framework

Context-Aware Term Weighting For First Stage Passage Retrieval. Zhuyun Dai et.al. SIGIR 2020 short. [code] (DeepCT)
Context-Aware Document Term Weighting for Ad-Hoc Search Zhuyun Dai et.al. WWW 2020. [code] (HDCT)
Document Expansion by Query Prediction. Rodrigo Nogueira et.al. [doc2query code,docTTTTTquery code] (doc2query, docTTTTTquery)

Design new pre-training tasks for retrieval

Latent Retrieval for Weakly Supervised Open Domain Question Answering. Kenton Lee et.al. ACL 2019. [code] (ORQA, ICT)
Pre-training tasks for embedding-based large scale retrieva. Wei-Cheng Chang et.al. ICLR 2020. (ICT, BFS and WLP)
REALM: Retrieval-Augmented Language Model Pre-Training. Kelvin Guu, Kenton Lee et.al. ICML 2020. [code] (REALM)

Decouple the encoding of query and document

In traditional ad-hoc retrieval

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. Omar Khattab et.al. SIGIR 2020. [code] (ColBERT)
Efficient Document Re-Ranking for Transformers by Precomputing Term Representations. Sean MacAvaney et.al. SIGIR 2020. [code] (PreTTR)
Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. Samuel Humeau,Kurt Shuster et.al. ICLR 2020. [code] (Poly-encoders)
Modularized Transfomer-based Ranking Framework Luyu Gao et.al. EMNLP 2020. [code] (MORES, similar to Poly-encoders)
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. Lee Xiong, Chenyan Xiong et.al. [code] (ANCE)
RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. Jingtao Zhan et.al. [code] (RepBERT)
RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. Yingqi Qu et.al. (RocketQA)

In open domain question answering

Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. Minjoon Seo,Jinhyuk Lee et.al. ACL 2019. [code] (DENSPI)
Dense Passage Retrieval for Open-Domain Question Answering. Vladimir Karpukhin,Barlas Oguz et.al. EMNLP 2020 [code] (DPR)
Contextualized Sparse Representations for Real-Time Open-Domain Question Answering. Jinhyuk Lee, Minjoon Seo et.al. ACL 2020. [code] (SPARC, sparse vectors)
DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding. Yuyu Zhang, Ping Nie et.al. SIGIR 2020 short. (DC-BERT)
Distilling Knowledge from Reader to Retriever for Question Answering. Gautier Izacard1 et.al. ICLR 2021.
Learning Dense Representations of Phrases at Scale. Jinhyuk Lee, Danqi Chen et.al. ArxiV 2021. [code] (DensePhrases)

Re-ranking Stage

Directly apply pre-trained models to IR
Design new pre-training tasks for reranking
Modify on top of the existing pre-trained models

Directly apply pre-trained models to IR

Passage Re-ranking with BERT. Rodrigo Nogueira et.al. [code] (monoBERT: Maybe the first work on applying BERT to IR)
Multi-Stage Document Ranking with BERT. Rodrigo Nogueira et.al. (duoBERT: pointwise+pairwise)
Simple Applications of BERT for Ad Hoc Document Retrieval. / Applying BERT to Document Retrieval with Birch. Wei Yang, Haotian Zhang et.al. / Zeynep Akkalyoncu Yilmaz et.al. EMNLP 2019 short. [code] (Birch: Sentence-level)
Deeper Text Understanding for IR with Contextual Neural Language Modeling. Zhuyun Dai et.al. SIGIR 2020 short. [code] (BERT-MaxP, BERT-firstP, BERT-sumP: Passage-level)
CEDR: Contextualized Embeddings for Document Ranking. Sean MacAvaney et.al. SIGIR 2020 short. [code] (CEDR: BERT+ranking model)
Training Curricula for Open Domain Answer Re-Ranking. Sean MacAvaney et.al. SIGIR 2020. [code] (curriculum learning based on BM25)
Leveraging Passage-level Cumulative Gain for Document Ranking. Zhijing Wu et.al. WWW 2020. (PCGM)
Selective Weak Supervision for Neural Information Retrieval. Kaitao Zhang et.al. WWW 2020. [code] (ReInfoSelect)
Document Ranking with a Pretrained Sequence-to-Sequence Model. Rodrigo Nogueira, Zhiying Jiang et.al. EMNLP 2020. [code] (using T5)
Beyond [CLS] through Ranking by Generation. Cicero Nogueira dos Santos et.al. EMNLP 2020 short. (query likelihood computed by GPT)
BERT-QE: Contextualized Query Expansion for Document Re-ranking. Zhi Zheng et.al. EMNLP 2020 Findings. [code] (BERT-QE)
Cross-lingual Retrieval for Iterative Self-Supervised Training. Chau Tran et.al. NIPS 2020. [code] (CRISS)
A Linguistic Study on Relevance Modeling in Information Retrieval. Yixing Fan, Jiafeng Guo et.al. WWW 2021. (Prob & Intervention)
Generalizing Discriminative Retrieval Models using Generative Tasks. Bingsheng Liu, Hamed Zamani et.al. WWW 2021. (GDMTL,joint discriminative and generative model with multitask learning)

Design new pre-training tasks for reranking

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval. Xinyu Ma et.al. WSDM 2021. [code] (PROP)

Modify on top of the existing pre-trained models

Local Self-Attention over Long Text for Efficient Document Retrieval. Sebastian Hofstätter et.al. SIGIR 2020 short. [code] (TKL:Transformer-Kernel for long text)
The Cascade Transformer: an Application for Efficient Answer Sentence Selection. Luca Soldaini et.al. ACL 2020.[code] (Cascade Transformer)
Using Prior Knowledge to Guide BERT’s Attention in Semantic Textual Matching Tasks. Tingyu Xia et.al. WWW 2021.[code] (Text Matching: Guide BERT's Attention)

Multimodel Retrieval

Unified Single-stream Architecture

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. Gen Li, Nan Duan et.al. AAAI 2020. [code] (Unicoder-VL)
XGPT: Cross-modal Generative Pre-Training for Image Captioning. Qiaolin Xia, Haoyang Huang, Nan Duan et.al. Arxiv 2020. [code] (XGPT)
UNITER: UNiversal Image-TExt Representation Learning. Yen-Chun Chen, Linjie Li et.al. ECCV 2020. [code] (UNITER)
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. Xiujun Li, Xi Yin et.al. ECCV 2020. [code] (Oscar)
VinVL: Making Visual Representations Matter in Vision-Language Models. Pengchuan Zhang, Xiujun Li et.al. ECCV 2020. [code] (VinVL)

Multi-stream Architecture Applied on Input

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Jiasen Lu, Dhruv Batra et.al. NeurIPS 2019. [code] (VilBERT)
12-in-1: Multi-Task Vision and Language Representation Learning. Jiasen Lu, Dhruv Batra et.al. CVPR 2020. [code] (A multi-task model based on VilBERT)
Learning Transferable Visual Models From Natural Language Supervision. Alec Radford et.al. CVPR 2020. [code] (CLIP, GPT team)
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. Fei Yu, Jiji Tang et.al. Arxiv 2020. [code] (ERNIE-ViL，1st place on the VCR leaderboard)
M6-v0: Vision-and-Language Interaction for Multi-modal Pretraining. Junyang Lin, An Yang et.al. KDD 2020. (M6-v0/InterBERT)
M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training. Haoyang Huang, Lin Su et.al. CVPR 2021. [code] (M3P, MILD dataset)

Other Resources

Other Surveys About Efficient Transformers

Efficient Transformers: A Survey. Yi Tay, Mostafa Dehghani et.al. Arxiv 2020.

About

A curated list of awesome papers related to pre-trained models for information retrieval.