terrier-org / pyterrier

A Python framework for performing information retrieval experiments, building on http://terrier.org/

Home Page:https://pyterrier.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RM3 does not add additional terms to the query for very small corpora

mam10eks opened this issue · comments

Dear all,

Thank you very much for your efforts with PyTerrier, it is a really awesome framework!
I currently want to create small Hello World examples to showcase some concepts for our IR course, and it looks to me like RM3 does not behave as I would expect in my example, because RM3 does not expand the query (maybe I used it wrong, or there are some caveats because my example is so small).
This is not urgent, as it works with Bo1QueryExpansion and KLQueryExpansion.

Describe the bug
RM3 does not add additional terms to the query in the hello world example described below.
E.g., For the same relevance feedback document, RM3 expands the query dog to applypipeline:off dog^0.600000024 whereas Bo1 expanded the query dog to applypipeline:off dog^2.000000000 colli^1.000000000 3^1.000000000 border^1.000000000 shepherd^1.000000000 1^1.000000000 type^1.000000000 german^1.000000000 2^1.000000000 poodl^0.805050646.

To Reproduce

I have prepared a Colab notebook that showcases the problem.

Steps to reproduce the behavior:

  1. Which index
documents = [
    {'docno': 'd1', 'text': 'The Golden Retriever is a Scottish breed of medium size.'},
    {'docno': 'd2', 'text': 'Intelligent types of dogs are: (1) Border Collies, (2) Poodles, and (3) German Shepherds.'},
    {'docno': 'd3', 'text': 'Poodles are a highly intelligent, energetic, and sociable.'},
    {'docno': 'd4', 'text': 'The European Shorthair is medium-sized to large cat with a well-muscled chest.'},
    {'docno': 'd5', 'text': 'The domestic canary is a small songbird.'}
]

if not pt.started():
    pt.init(boot_packages=["com.github.terrierteam:terrier-prf:-SNAPSHOT"])

indexer = pt.IterDictIndexer("/tmp/index", overwrite=True, blocks=True)
index_ref = indexer.index(documents)
index = pt.IndexFactory.of(index_ref)
  1. Which topics
topic = pd.DataFrame([{'qid': '1', 'query': 'dog'}])
  1. Which pipeline
explicit_relevance_feedback = pt.Transformer.from_df(pd.DataFrame([{'query': 'dog', 'qid': '1', 'docno': 'd2'}]))

bo1_expansion = explicit_relevance_feedback >> pt.rewrite.Bo1QueryExpansion(index)
rm3_expansion = explicit_relevance_feedback >> pt.rewrite.RM3(index)
  1. Output

Screenshot_20231022_104422

Expected behavior

I would expect that RM3 also adds terms like colli, shepherd, poodl, etc to the query.

Documentation and Issues

Thanks in advance!

Best regards,

Maik

Hi @mam10eks. Thanks for the report. I had time today to figure this out.

Firstly, adding pt.logging('INFO') makes RM3 a little bit more verbose. It wasnt quite verbose enough for me, so I added some more logging.

Like that, I got the output:

17:14:05.015 [main] WARN org.terrier.querying.RM1 - Did not identify any usable candidate expansion terms from docid 1
17:14:05.020 [main] INFO org.terrier.querying.RM1 - Found 0 terms after feedback document analysis

If I add the following properties, I get:

pt.set_property("prf.mindf", "1")
pt.set_property("prf.maxdp", "1")

I get:

17:14:45.730 [main] INFO org.terrier.querying.RM1 - Analysing 1 feedback documents
17:14:45.749 [main] INFO org.terrier.querying.RM1 - Found 11 terms after feedback document analysis
17:14:45.755 [main] INFO org.terrier.querying.RM3 - Reformulated query q100 @ lambda=0.6: intellig^0.03999999538064003 german^0.03999999538064003 colli^0.03999999538064003 2^0.03999999538064003 1^0.03999999538064003 border^0.03999999538064003 dog^0.64000004529953 3^0.03999999538064003 type^0.03999999538064003 poodl^0.03999999538064003

I think the overall lesson here is that sometimes small corpora do not exhibit the necessary statistics that tuned models expect.

My dev notebook is at:
https://colab.research.google.com/drive/1BKLNZUTc9DidgHXM8L4o7vPXNEW9PpOO?usp=sharing

HTH

On a related point, RM3 (unlike Bo1) is intended to examine the score of the retrieved documents. I dont think the score is appropriately passed to RM3 by PyTerrier:

See https://github.com/terrier-org/pyterrier/blob/master/pyterrier/rewrite.py#L223 and https://github.com/terrier-org/pyterrier/blob/master/pyterrier/rewrite.py#L239. Would you able to verify, and see the impact on Robust04 performance when fixing this?

Dear Craig,

Thanks for your response!
Indeed, I was not aware of the prf.mindf and the prf.maxdp properties, and setting them resolves the problem in my tiny hello world example.

I will have a look into forwarding the retrieval scores to RM3 and will report back the impact on Robust04.

Best regards,

Maik

Hi, I have not yet started with this, but at the beginning of next week I have time to look into this :)

@mam10eks any news? I'd like to fix this for the next PyTerrier release.

Dear @cmacdonald sorry again for the delay!

I have things for the upcoming CLEF deadline on my agenda until around friday, this means I could have first results here on monday or tuesday. Would this still suffice?

Best regards,

Maik