dmh43 / pyterrier_ance

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PyTerrier_ANCE

This is the PyTerrier plugin for the ANCE dense passage retriever.

Installation

This repostory can be installed using Pip.

pip install --upgrade git+https://github.com/terrierteam/pyterrier_ance.git

You will need FAISS (cpu or gpu) installed:

On Colab:

!pip install faiss-cpu 

On Anaconda:

# CPU-only version
$ conda install -c pytorch faiss-cpu

# GPU(+CPU) version
$ conda install -c pytorch faiss-gpu

For ANCE, the CPU version is sufficient.

Indexing

You will need a pre-trained ANCE checkpoint. There are several available from the ANCE repository.

Then, indexing is as easy as instantiating the indexer, pointing at the (unzipped) checkpoint and the directory in which you wish to create an index

dataset = pt.get_dataset("irds:vaswani")
import pyterrier_ance
indexer = pyterrier_ance.ANCEIndexer("/path/to/checkpoint", "/path/to/anceindex")
indexer.index(dataset.get_corpus_iter())

Retrieval

You can instantiate the retrieval transformer, again by specifying the checkpoint location and the index location:

anceretr = pyterrier_ance.ANCERetrieval("/path/to/checkpoint", "/path/to/anceindex")

Thereafter, you can use it in the normal PyTerrier way, for instance in an experiment:

pt.Experiment(
    [anceretr], 
    dataset.get_topics(), 
    dataset.get_qrels(), 
    eval_metrics=["map"]
)

You can also use ANCE as a re-ranker to score text (e.g., as a re-ranker) using ANCETextScorer.

ance_text_scorer = pyterrier_ance.ANCETextScorer("/path/to/checkpoint")
# You'll need to use this in a retrieval pipeline that includes the document text, e.g.:
# bm25 >> pt.text.get_text(dataset, 'text') >> ance_text_scorer

Documents longer than Passages

If your documents are longer than passages, you should apply passaging to them before indexing, and max passage (say) during retrieval:

# indexing
dataset = pt.get_dataset("irds:vaswani")
import pyterrier_ance
indexer = pt.text.sliding("text", prepend_attr=None) >> pyterrier_ance.ANCEIndexer("/path/to/checkpoint", "/path/to/anceindex")
indexer.index(dataset.get_corpus_iter())

# retrieval 

ance_maxp = pyterrier_ance.ANCERetrieval("/path/to/checkpoint", "/path/to/anceindex") >> pt.text.max_passage()

Examples

Checkout out the notebooks, even on Colab:

The Terrier data repository contains ANCE indices for several corpora, including Vaswani and MSMARCO Passage v1.

Implementation Details

We use a fork-ed copy of ANCE that makes it pip installable, and addresses other quibbles.

References

  • [Xiong20] Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, Arnold Overwijk. https://arxiv.org/pdf/2007.00808.pdf
  • [Macdonald20]: Craig Macdonald, Nicola Tonellotto. Declarative Experimentation inInformation Retrieval using PyTerrier. Craig Macdonald and Nicola Tonellotto. In Proceedings of ICTIR 2020. https://arxiv.org/abs/2007.14271

Credits

  • Craig Macdonald, University of Glasgow
  • Nicola Tonellotto, University of Pisa
  • Sean MacAvaney, University of Glasgow
  • Dany Haddad, University of Texas

About


Languages

Language:Jupyter Notebook 88.0%Language:Python 12.0%