Disparity in BM25 performance

Question

Disparity in BM25 performance

KrishnenduGhosh opened this issue 8 years ago · comments

Hi Tao Lei,

Recently I was trying to develop a Lucene based BM25 baseline method using the Askubunbtu dataset you provided. While writing the indexwriter I used title+body from all the 167765 questions and while testing I searched for title+body for all the 189 queries (11 queries have no similar questions). The indexsearcher similarity I set as BM25similarity in Apache Lucene 6.1.0. I have used all Lucene settings as default apart from the analyzer (EnglishAnalyzer).

But the problem is: I am getting a MAP value of around 0.11 which is not at all comparable to the performance you mentioned for BM25. Hence, I feel that somewhere I am missing some steps. Can you please help me in that issue?

Tao Lei · Answer 1 · Sat Sep 03 2016 14:25:18 GMT+0800 (China Standard Time)

Hi @KrishnenduGhosh

Rish (the second author) worked on the BM baseline method for this project. I remember that he spent quite a bit of time tuning the BM baseline and preprocessing.

Could you email Rish (hrishjoshi2@gmail.com and hjoshi@mit.edu) for more information about the Lucene set-up? Sorry about the inconvenience.

Tao Lei · Answer 2 · Sat Sep 03 2016 14:27:31 GMT+0800 (China Standard Time)

@KrishnenduGhosh I could also help to email him as well. Just let me know your email address.

Krishnendu Ghosh · Answer 3 · Sat Sep 03 2016 15:38:49 GMT+0800 (China Standard Time)

@taolei87 My email address is: kghosh.cs@gmail.com / kghosh.cs@iitkgp.ac.in