Dense Hybrid Retrieval

In this repo, we introduce two approaches to training transformers to capture semantic and lexical text representations for robust dense passage retrieval.

Aggretriever: A Simple Approach to Aggregate Textual Representation for Robust Dense Passage Retrieval Sheng-Chieh Lin, Minghan Li and Jimmy Lin.
A Dense Representation Framework for Lexical and Semantic Matching Sheng-Chieh Lin and Jimmy Lin.

This repo contains three parts: (1) densify (2) training (tevatron) (3) retrieval. Our training code is mainly from Tevatron with a minor revision.

Requirements

pip install torch>=1.7.0
pip install transformers==4.15.0
pip install pyserini
pip install beir

Aggretriever

In this paper, we introduce a simple approach to aggregating token-level information into a single-vector dense representation. We provide instruction for model training and evaluation on MS MARCO passage ranking dataset in the document. We also provide instruction for the evaluation on BEIR datasets in the document.

A Dense Representation Framework for Lexical and Semantic Matching

In this paper, we introduce a unified representation framework for Lexical and Semantic Matching. We first introduce how to use our framework to conduct retrieval for high-dimensional (lexcial) representations and combine with single-vector dense (semantic) representations for hybrid search.

Dense Lexical Retrieval

We can densify any existing lexical matching models and conduct lexical matching on GPU. In the document, we demonstrate how to conduct BM25 and uniCOIL end-to-end retrieval under our framework. Detailed description can be found in our paper.