billstark / ranked-retrieval

a ranked retrieval model that is used for searching

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This is the README file for A0135817B and A0147995H's submission

Email address:

== Python Version ==

We're using Python Version 2.7.14 for this assignment.

== General Notes about this assignment ==

## Indexing:


  1. For every training file, load the whole content into the system and tokenize into words
  2. Remove invalid characters of the tokenized word
  3. Count the term in a dictionary mapping document id to a term frequency counter
  4. Write postings into the posting file
  5. Write terms into dictionary file with offset to the corresponding postings.


  - We choose not to use sentence tokenizer because we are handle invalid punctuation by keeping only
     digits, alphabets, white spaces and dashes

 - The entire dictionary is read into memory
 - Each query is tokenized using the same method as for indexing
 - For each query term, we calculate their weights using 1 + log(tf)
 - For every document that any of the query terms appear in, we calculate the total score using the query weight
   and the document tf-idf
 - We collect the top 10 documents using a priority queue and return the result

== Files included with this submission ==       - includes the tokenization function which is used for both index and search        - indexing program that will be run to index all the training files.       - searching program that will be used to execute queries in a specific file and give output.
dictionary.txt  - a dictionary mapping terms to their location in the postings file
postings.txt    - includes document IDs for each term

== Statement of individual work ==

Please initial one of the following statements.

[X] We, A0135817B and A0147995H, certify that I have followed the CS 3245 Information
Retrieval class guidelines for homework assignments.  In particular, I
expressly vow that I have followed the Facebook rule in discussing
with others in doing the assignment and did not take notes (digital or
printed) from the discussions.  

[ ] I, A0000000X, did not follow the class rules regarding homework
assignment, because of the following reason:

<Please fill in>

I suggest that I should be graded as follows:

<Please fill in>

== References ==

- tf-idf


a ranked retrieval model that is used for searching


Language:Python 100.0%