terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/

Home Page:https://pyterrier.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Trying to tune BM25F so that it behaves exactly as BM25

Sondeluz opened this issue · comments

commented

Hello,

I am trying to tune the w. and c. controls in BM25F in pyterrier so that it behaves exactly like a simple BM25 query. I was doing this because I noticed that, compared to an identical index in elastic, BM25 had similar results in my testing dataset while BM25F was behaving really badly, with a performance way below BM25.

My documents consist of three fields (bucket0, bucket1, bucket2), and in the case of BM25 I instance the retriever like this:

retriever = pt.BatchRetrieve(index, wmodel="BM25").parallel(multiprocessing.cpu_count())

And with BM25F:

retriever = pt.BatchRetrieve(index, wmodel='BM25F', controls=controls).parallel(multiprocessing.cpu_count())

with controls being a dict containing:

controls: {'w.0': 1.0, 'c.0': 1.0, 'w.1': 1.0, 'c.1': 1.0, 'w.2': 1.0, 'c.2': 1.0}

Shouldn't the BM25F query have the same behavior as the BM25 query if I use identical weights and normalization parameters?

I have ensured that:

  • I instance the same pipe regardless of the retriever I use:
retriever = self.get_retriever(json_data, search_type)
pipe = pt.rewrite.tokenise(pt.index.TerrierTokeniser.utf) >> retriever

queryDf = pd.DataFrame(queries, columns=["qid", "query"])
results = pipe.transform(queryDf)
  • The fields are detected correctly:
index.getCollectionStatistics().getFieldNames(): ['bucket0', 'bucket1', 'bucket2']

I am unsure whether this is a terrier bug or a configuration issue on my part (most likely!)

Shouldn't the BM25F query have the same behavior as the BM25 query if I use identical weights and normalization parameters?

No, I dont think so. Per-field normalisation means that the term frequencies are being normalised WITHIN each field. That means with multiple fields, you cannot recover the original BM25 behaviour from BM25F.
See also Equation 3.20 in https://www.staff.city.ac.uk/~sbrp622/papers/foundations_bm25_review.pdf