Query-Likelihood-Retrieval-Model

In the query likelihood retrieval model, we rank documents by the probability that the query text could be generated by the document language model.
We calculate the probability that we could pull the query words out of the “bucket” of words representing the document.
This is a model of topical relevance,in the sense that the probability of query generation is the measure of how likely it is that a document is about the same topic as the query.

Jelinek-Mercer Smoothing

Smoothing refers to the process of adjusting the maximum likelihood estimator to account for inaccuracy due to data sparseness.
Jelinek-Mercer Smoothing is a linear interpolation of the document and collection word probabilities, where the coefficient λ determines the weighing balance between the two terms
Linearly interpolated between document language model and the collection language model
For lambda, we choose different optimal values for different queries. Experiments have shown that a small value of lambda, around 0.1, works well for long queries and a higher value around 0.7 for short queries.

The CACM collection dataset has been acquired from http://ir.dcs.gla.ac.uk/resources/test_collections/cacm/
The CACM collection is a collection of titles and abstracts from the journal CACM.
The collection consists of the following files:
cacm.all - Text of documents
cite.info - Key to citation info
common_words - Stop words used by smart
qrels.text - List of relevance judgements
query.text - Original text of the query
CACM HTML documents are obtained from: https://github.com/kaanosm/inb344/tree/845ae8c8c6e5e193e4f8e9c399ddc9f3c82e39f0/week%201/Resources
64 queries , 3204 HTML documents

Query Likelihood Retrieval Model using Jelinek-Mercer Smoothing technique.

Language:HTML 90.4%Language:Python 9.6%