RM3 does not add additional terms to the query for very small corpora

Question

RM3 does not add additional terms to the query for very small corpora

mam10eks opened this issue 9 months ago · comments

Dear all,

Thank you very much for your efforts with PyTerrier, it is a really awesome framework!
I currently want to create small Hello World examples to showcase some concepts for our IR course, and it looks to me like RM3 does not behave as I would expect in my example, because RM3 does not expand the query (maybe I used it wrong, or there are some caveats because my example is so small).
This is not urgent, as it works with Bo1QueryExpansion and KLQueryExpansion.

Describe the bug
RM3 does not add additional terms to the query in the hello world example described below.
E.g., For the same relevance feedback document, RM3 expands the query dog to applypipeline:off dog^0.600000024 whereas Bo1 expanded the query dog to applypipeline:off dog^2.000000000 colli^1.000000000 3^1.000000000 border^1.000000000 shepherd^1.000000000 1^1.000000000 type^1.000000000 german^1.000000000 2^1.000000000 poodl^0.805050646.

To Reproduce

I have prepared a Colab notebook that showcases the problem.

Steps to reproduce the behavior:

Which index

documents = [
    {'docno': 'd1', 'text': 'The Golden Retriever is a Scottish breed of medium size.'},
    {'docno': 'd2', 'text': 'Intelligent types of dogs are: (1) Border Collies, (2) Poodles, and (3) German Shepherds.'},
    {'docno': 'd3', 'text': 'Poodles are a highly intelligent, energetic, and sociable.'},
    {'docno': 'd4', 'text': 'The European Shorthair is medium-sized to large cat with a well-muscled chest.'},
    {'docno': 'd5', 'text': 'The domestic canary is a small songbird.'}
]

if not pt.started():
    pt.init(boot_packages=["com.github.terrierteam:terrier-prf:-SNAPSHOT"])

indexer = pt.IterDictIndexer("/tmp/index", overwrite=True, blocks=True)
index_ref = indexer.index(documents)
index = pt.IndexFactory.of(index_ref)

Which topics

topic = pd.DataFrame([{'qid': '1', 'query': 'dog'}])

Which pipeline

explicit_relevance_feedback = pt.Transformer.from_df(pd.DataFrame([{'query': 'dog', 'qid': '1', 'docno': 'd2'}]))

bo1_expansion = explicit_relevance_feedback >> pt.rewrite.Bo1QueryExpansion(index)
rm3_expansion = explicit_relevance_feedback >> pt.rewrite.RM3(index)

Output

Expected behavior

I would expect that RM3 also adds terms like colli, shepherd, poodl, etc to the query.

Documentation and Issues

I have checked the PyTerrier documentation for relevant content
I have checked for previous relevant PyTerrier issues

Thanks in advance!

Best regards,

Maik

Craig Macdonald · Answer 1 · Thu Oct 26 2023 01:16:42 GMT+0800 (China Standard Time)

Hi @mam10eks. Thanks for the report. I had time today to figure this out.

Firstly, adding pt.logging('INFO') makes RM3 a little bit more verbose. It wasnt quite verbose enough for me, so I added some more logging.

Like that, I got the output:

17:14:05.015 [main] WARN org.terrier.querying.RM1 - Did not identify any usable candidate expansion terms from docid 1
17:14:05.020 [main] INFO org.terrier.querying.RM1 - Found 0 terms after feedback document analysis

If I add the following properties, I get:

pt.set_property("prf.mindf", "1")
pt.set_property("prf.maxdp", "1")

I get:

17:14:45.730 [main] INFO org.terrier.querying.RM1 - Analysing 1 feedback documents
17:14:45.749 [main] INFO org.terrier.querying.RM1 - Found 11 terms after feedback document analysis
17:14:45.755 [main] INFO org.terrier.querying.RM3 - Reformulated query q100 @ lambda=0.6: intellig^0.03999999538064003 german^0.03999999538064003 colli^0.03999999538064003 2^0.03999999538064003 1^0.03999999538064003 border^0.03999999538064003 dog^0.64000004529953 3^0.03999999538064003 type^0.03999999538064003 poodl^0.03999999538064003

I think the overall lesson here is that sometimes small corpora do not exhibit the necessary statistics that tuned models expect.

My dev notebook is at:
https://colab.research.google.com/drive/1BKLNZUTc9DidgHXM8L4o7vPXNEW9PpOO?usp=sharing

HTH

Craig Macdonald · Answer 2 · Thu Oct 26 2023 01:18:36 GMT+0800 (China Standard Time)

On a related point, RM3 (unlike Bo1) is intended to examine the score of the retrieved documents. I dont think the score is appropriately passed to RM3 by PyTerrier:

See https://github.com/terrier-org/pyterrier/blob/master/pyterrier/rewrite.py#L223 and https://github.com/terrier-org/pyterrier/blob/master/pyterrier/rewrite.py#L239. Would you able to verify, and see the impact on Robust04 performance when fixing this?

Maik Fröbe · Answer 3 · Fri Oct 27 2023 06:21:01 GMT+0800 (China Standard Time)

Dear Craig,

Thanks for your response!
Indeed, I was not aware of the prf.mindf and the prf.maxdp properties, and setting them resolves the problem in my tiny hello world example.

I will have a look into forwarding the retrieval scores to RM3 and will report back the impact on Robust04.

Best regards,

Maik

Craig Macdonald · Answer 4 · Thu Nov 30 2023 19:34:18 GMT+0800 (China Standard Time)

@mam10eks any news?

Maik Fröbe · Answer 5 · Thu Nov 30 2023 22:41:14 GMT+0800 (China Standard Time)

Hi, I have not yet started with this, but at the beginning of next week I have time to look into this :)

Craig Macdonald · Answer 6 · Fri Mar 01 2024 22:43:05 GMT+0800 (China Standard Time)

@mam10eks any news?

Craig Macdonald · Answer 7 · Tue Apr 23 2024 02:15:31 GMT+0800 (China Standard Time)

@mam10eks any news? I'd like to fix this for the next PyTerrier release.

Maik Fröbe · Answer 8 · Wed Apr 24 2024 16:45:11 GMT+0800 (China Standard Time)

Dear @cmacdonald sorry again for the delay!

I have things for the upcoming CLEF deadline on my agenda until around friday, this means I could have first results here on monday or tuesday. Would this still suffice?

Best regards,

Maik