More accurate BM25 similarity for Lucene

Improving the effectiveness Lucene's BM25 (and testing it using community QA and ClueWeb* collections). Please, see my blog post for details. The test was run using an early version of Lucene 6.x, e.g., 6.0. I recently upgraded to work with Lucene 8.0, but did not fully retest it.

Main Prerequisites

Data
Yahoo Answers! data set needs to be obtained from Yahoo! Webscope;
Stack Overflow data set can be freely downloaded: we need only posts. It needs to be subsequently converted to the Yahoo Answers! format using the script scripts/convert_stack_overflow.sh. The converted collection that I used is also available here. Note that I converted data without including any Stack Overflow code (exclusion of the code makes the retrieval task harder).
ClueWeb09 & ClueWeb12. I use Category B only, which is a subset containing about 50 million documents. Unfortunately, these collections aren't freely available for download. For details on obtaining access to these collections, please refer to the official documents: CluewWeb09, ClueWeb12.
You need Java 7 and Maven;
To carry out evaluations you need R, Python, and Perl. Should you decide to use an old style evaluation scripts (not enabled by default), you will also need a C compiler. The evaluation script will download and compile TREC trec_eval evaluation utility on its own.

Indexing

The low-level indexing script is scripts/lucene_index.sh. I have also implemented a wrapper script that I recommend using instead. To create indices and auxilliary files in subdirectories exper/compr (for Yahoo Answers! Comprehensive) and exper/stack (for Stack Oveflow), I used the following commands (you will need to specify location of input/output files on your own computer):

scripts/create_indices.sh ~/TextCollect/StackOverflow/PostsNoCode2016-04-28.xml.bz2 exper/stack yahoo_answers
scripts/create_indices.sh ~/TextCollect/YahooAnswers/Comprehensive/FullOct2007.xml.bz2 exper/compr yahoo_answers

Note that last argument, it specifies the type of input data. In the case of ClueWeb09 and ClueWeb09, the type of data set is clueweb. A sample indexing command (additional quotes are needed, because the input data set directory has a space in its full path):

scripts/create_indices.sh "\"/media/leo/Seagate Expansion Drive/ClueWeb12_B13/\"" exper/clueweb12 clueweb

Again, don't forget that you have to specify the location of input/output files on your computer!

Expert indexing

To see the indexing options of the low-level indexing script, type:

scripts/lucene_index.sh

In addition to an input file (which can be gzipped or bzipped2), you have to specify the output directory to store a Lucene index. For community QA data you can specify the location of an output file to store TREC-style QREL files.

Testing with community QA data sets

The low-level querying script is scripts/lucene_query.sh, but I strongly recommend to use a wrapper scripts/run_eval_queries.sh that does almost all the evaluation work (except extracting average retrieval time). The following is an example of invoking the evaluation script:

scripts/run_eval_queries.sh ~/TextCollect/YahooAnswers/Comprehensive/FullOct2007.xml.bz2 yahoo_answers exper/compr/ 10000 5 1

We ask here to use the first 10000 questions. The search series is repeated 5 times. The value of the last argument tells the script to evaluate effectiveness as well as to compute p-values. Again, you need R, Perl, Python for this. You can hack an evaluation script and set the variable USE_OLD_STYLE_EVAL_FOR_YAHOO_ANSWERS to 1. In this case, you will also need a C compiler.

Note 1: the second argument is the type of data source. Use yahoo_answers for community QA collections. For ClueWeb09 and clueweb12 use trec_web.

Note 2: the script will not re-run queries if output files already exist!. To re-run queries you need to manually delete file named trec_run from respective subdirectories.

The average retrieval times are saved to a log file. They can be extracted as follows:

grep 'on average' exper/compr/standard/query.log

To retrieve the list of timings for every run as a space-separated sequence, you can do the following:

grep 'on average' exper/compr/standard/query.log |awk '{printf("%s%s",t,$7);t=" "}END{print ""}'

Note that exper/compr in these examples should be replaced with your own top-level directory that you pass to the script scripts/create_indices.sh.

Testing with ClueWeb09/12 data sets

Again, please, use the script scripts/lucene_query.sh. However, an additional arguments: relevance judgements (the so-called QREL files) should be specified. For ClueWeb09 data, I have placed both judgements (QREL-files) and queries to this repo. Therefore, one can run evaluations as follows (don't forget to specify your own directories with ClueWeb09/12 indices instead of exper/clueweb09):

scripts/run_eval_queries.sh eval_data/clueweb09_1MQ/queries_1MQ.txt trec_web exper/clueweb09 10000 1 1 eval_data/clueweb09_1MQ/qrels_1MQ.txt

For ClueWeb12 data sets, query files and QREL-files need to be generated. To do this, you first need to download and uncompress UQV100 files. Then, you can generate QREL-files and queries, e.g., as follows:

eval_data/uqv100/merge_uqv100.py  ~/TextCollect/uqv100/ eval_data/uqv100/uqv100_mult_queries.txt eval_data/uqv100/uqv100_mult_qrels.txt

Finally, you can use generated QREL and queries to run evaluation:

scripts/run_eval_queries.sh eval_data/uqv100/uqv100_mult_queries.txt trec_web exper/clueweb12/ 10000 1 1 eval_data/uqv100/uqv100_mult_qrels.txt

Expert querying

To see options of the low-level querying script, type:

scripts/lucene_query.sh

It is possible to evaluate all questions, as well as randomly select a subset of questions. It is also possible to limit the number of queries to execute. A sample invocation of lucene_query.sh:

scripts/lucene_query.sh -d ~/lucene/yahoo_answers_baseline/ -i /home/leo/TextCollect/YahooAnswers/Manner/manner-v2.0/manner.xml.bz2 -source_type "yahoo_answers" -n 15 -o eval/out -prob 0.01 -bm25_k1 0.6 -bm25_b 0.25 -max_query_qty 10000 -s data/stopwords.txt

Note the stopword file!

The effectiveness can be evaluated using the above mentioned utility trec_eval and utilty gdeval.pl located in directory scripts. To this end, you need QREL files produced during indexing.

We use the BM25 similarity function. The default parameter values are k1=1.2 and b=0.75. These values are specified via parameters bm25_k1 and bm25_b.

Note on using Stanford NLP

One can use Stanford NLP to tokenize and lemmatize input. To activate Stanford NLP lemmatizer set the value of the constant UtilConst.USE_STANFORD in the file UtilConst.java to true. The code will be recomiplied automatically if you use our scripts to index/query. This doesn't seem to improve performance, though, but processing is much longer.

searchivarius / AccurateLuceneBM25