rjagerman / TipsterSearch

Basic information retrieval system as part of the IR2014 Project at ETH.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TipsterSearch

This information retrieval system is part of the 2014 Information Retrieval course at ETH. It searches and ranks documents in the tipster dataset according to a set of queries. Additionally it performs several metrics on the found results and displays these at the end of the search.

Search models

This project uses two different search models. One term-based and one language-based.

Term-based model

The term-based model uses a logarithmically scaled TFIDF model and is defined in the file src/main/scala/scoring/TfidfModel.scala. The score is computed as the sum of the scores of the individial words in the query, which are logarithmically scaled:

      ∑
  w∈query     (1.0 + log2(tf(w)))
∧ w∈document

Where

tf(w) = the term frequency of a word in the document

Language-based model

The language-based model uses Jelinek-Mercer smoothing and is defined in the file src/main/scala/scoring/LanguageModel.scala. The score is computed by summing, for every word in the query, the log of the probability of the word given a document p(w|d) divided by the probability of the word over the entire collection p(w):

      ∑                    (1.0-λ) * p(w|d)
  w∈query     log2( 1.0 + ---------------- )
∧ w∈document                 λ    * p(w)

Where

p(w|d) = tf(w) / |d|
p(w) = cf(w) / |∑ cf(v)|
λ = 0.1
cf(w) = the collection frequency of a word

Instructions

The system is build using the Scala Build Tool (sbt) which can be found here. Please make sure you have sbt installed and can execute sbt from the command line before continuing.

Compilation

To compile, browse to the directory containing build.sbt and run:

sbt compile

Running

To run the software:

sbt run

You can supply the following command line parameters:

-n <value> | --n <value>
    The number of results to return per query (default: 100)
-d <value> | --tipsterDirectory <value>
    The directory where the tipster zips are placed (default: 'dataset/tipster')
-t <value> | --topicsFile <value>
    The topics file (default: 'dataset/topics')
-q <value> | --qrelsFile <value>
    The qrels file (default: 'dataset/qrels')
-m <value> | --model <value>
    The model to use, valid values: [language|tfidf] (default: 'tfidf')

For example, to specify the directory where the tipster zip files can be found use the -d parameter:

sbt "run -d /path/to/tipsterdataset/"

In order to specify which queries to execute, you have to create a topics file which contains them. This file should match the format of the sample topics file that is provided. Similarly, the truth values will have to be specified in a qrels file which should match the format of the qrels file that is provided. You can pass the location of both these files as parameters to the program:

sbt "run -d /path/to/tipsterdataset/ -t /path/to/topics -q /path/to/qrels"

In order to use the language model instead of the (default) tfidf model, you can use the -m parameter:

sbt "run -d /path/to/tipsterdataset/ -t /path/to/topics -q /path/to/qrels -m language"

About

Basic information retrieval system as part of the IR2014 Project at ETH.

License:MIT License


Languages

Language:Perl 55.7%Language:Scala 44.3%