pikulet / weighted-tfidf

tf-idf retrieval system based on nltk reuters dataset, information retrieval assignment

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

== Python Version ==

I'm using Python Version 3.6.6 for this assignment.

== General Notes about this assignment ==

### Indexing Algorithm
The algorithm loops through all the files to index them. 

Every document is read line by line, and passed through the nltk
sentence and word tokenisers. Within a document, the frequencies of
each term are tabulated and saved in a vector. The vector is then
converted to a weighted-tf scheme using the ln-scheme. 

A sample vector is as follows:
{ 	"egg" 	: ln(2) ,
	"bacon"	: ln(1) }, where ln calculates the weighted-tf value.

This vector is iterated through to save the corresponding
term-docID-termFrequency information in the posting list, as well as
term-termID-documentFrequency data in the dictionary. The vector is
also used to generate the document length using the cosine scheme.

The posting list is represented as a document, since document order
is not as relevant (we are only keen in documents containing the
query terms), and also because there is a need to store the term
frequency with the docID.

When all the documents have been processed, the total document count
and corresponding document frequencies of each term in the dictionary
are known. The idf is calculated and saved back to the dictionary.

### Search Algorithm

The search algorithm first normalises the query terms to the ln-scheme
vector as shown above. This shortened query, which has collapsed
similar terms together in the vector, is iterated.

For each query term, the corresponding posting list (represented as
a python dictionary) is retrieved. If the term idf is zero, then
either the term does not exist in the corpus, or that is appears in
every document. Such terms are ignored because they do not contribute
to understanding document relevance.

For the rest of the query terms, the weight of that particular query
term is generated by multiplying the weighted-tf with the term-idf.
The documents in the posting list are the iterated through, generating
the per-term-per-docID score by multiplying the query weight and the
document weight together. These scores are added to the corresponding
score tracker for that docID.

Once all the query terms have been processed, we perform a final
normalisation of document scores using the document length, in effect
performing a cosine similarity (dot product in earlier stage) ranking.

A max heap is then used to retrieve the top ten entries to be returned.

### Saving to Disk

Compared to the previous hw2, an improvement module is the use of
multiple pickle dumps and loads. The pickle module has its own
delimiter for retrieving the dumped bytes. The posting file is now
a series of posting list dumps, with their initial byte offset
recorded for retrieval.

### Essay Questions

1. In this assignment, we didn't ask you to support phrasal queries,
which is a feature that is typically supported in web search engines.
Describe how you would support phrasal search in conjunction with the
VSM model.

## METHOD A: SUPPORT PHRASAL INDEXING
Currently, there is no record of how many query terms a document
contains. A naive solution is to do context-specific indexing. It is
possible to directly have a list of common phrases in economics, social
policies, names of politicians which can be used in a similar fashion
as stopwords. These phrases could be recorded at indexing time. 

However, it is often difficult to predict the sheer multitude of
possible phrasal searches which could occur, unless the context is
specific (say, phrasal medical terms). It would then make more sense
to explore solutions which rank documents different at search time.

## METHOD B: MERGE POSTING LISTS, THEN CALCULATE SCORES
One way is to first merge the posting lists for all the query terms.
This approach is equivalent to a boolean query accumulating the query
terms with an "AND" operation. The score is then only calculated for
the documents in the final posting list. The major flaw of such an
approach is the intolerant retrieval, because only full matches are
classified.

Should there be insufficient documents (< 10) with the exact phrase
match, then shorter phrasal queries on the biwords
("annual car" and "car sales") can be spawned. A new challenge is
also introduced which is the relative importance of biwords, and which
biword search results should be returned. Metrics like the bigram
overlap can be used to determine which documents should be scored.

2. Describe how your search engine reacts to long documents and long
queries as compared to short documents and queries. Is the
normalization you use sufficient to address the problems
(see Section 6.4.4 for a hint)?

The lnc-ltc scheme uses a cosine normalisation of the document vectors
using the Euclidean length of the vectors.

Long documents are likely to contain more distinct terms, and a 
higher frequency of the terms. News articles in the reuters data set
can be classified as non-verbose documents, that is, it is unlikely
for the same content to be repeated across the document. Instead, they
are likely to contain a lot of unique terms and ideas to convey the
news to readers.

When performing the cosine normalisation, the effect of the long
document length is damped. Relative to short documents, the Euclidean
length of long documents is not going to be very much different. As a
result, long documents have a higher score. We can consider a simple
example below:

doc1: egg egg egg
doc2: egg egg egg cat cat cat cat	(7 words)
doc3: i had an egg for breakfast today 	(7 words)
length (doc1) = 3
length (doc2) = sqrt(square(3) + square(4)) = 5
length (doc3) = sqrt(7) = 2.65

We observe that unique terms significantly reduce the document length,
skewing the scores of non-verbose documents higher. This observation
is pertinent in understanding our reuters dataset, because the news
articles can be fairly inconsistent. Document 1 is a fairly long
article with a lot of unique terms, while document 7785 is a short
record that repeats itself.

The positive skewing on long, non-verbose documents and the existence
of an inconsistency in the quality of data suggests that the lnc-ltc
scheme may not be sufficient. We can compensate for the effect of
document length by changing our normalisation method from Euclidean
normalisation to Pivoted Document Length Normalisation.

In your judgement, is the ltc.lnc scheme (n.b., not the ranking scheme
you were asked to implement) sufficient for retrieving documents from
the Reuters-21578 collection?

There are two primary differences with using ltc-lnc vs. lnc-ltc.
A. Repeated calculation of idf
For every tf, the idf is multiplied. This has essentially no effect on
the scalar multiplication in cosine similarity.

score(t, d) = l-scheme(tf(t, d)) * l-scheme(tf(t, q)) * t-scheme(t)

However, a lot of extra work is done at indexing stage to first
calculate l-scheme(tf(t, d)) * t-scheme(t). 

It would be computationally more efficient to calculate
w(t, q) = l-scheme(tf(t, q)) * t-scheme(t) for the query term, then
multiple this value to w(t, d) for the respective document weights.

B. Document length
The idf of the terms are now factored in to document length. For
example, consider this corpus of two documents:

doc1: the the the cat
doc2: the the the the cat

idf(the) = 0
idf(cat) = 0.3

tf(doc1, the) = 1 + log(3) = 1.5
tf(doc1, cat) = 1 + log(1) = 1
tf(doc2, the) = 1 + log(4) = 1.6
tf(doc2, cat) = 1 + log(1) = 1

(lnc-ltc): Length of doc1 = sqrt(sq(1.5) + sq(1)) = 1.8
(lnc-ltc): Length of doc2 = sqrt(sq(1.6) + sq(1)) = 1.9
(ltc-lnc): Length of doc1 = 0
(ltc-lnc): Length of doc2 = 0

Documents which purely contain common terms cannot be differentiated
from each other as their lengths will all be normalised to 0. This
situation is not very problematic for the reuters data set because
news articles are likely to contain unique terms and references.

3. Do you think zone or field parametric indices would be useful for
practical search in the Reuters collection? Note: the Reuters
collection does have metadata for each article but the quality of the
metadata is not uniform, nor are the metadata classifications
uniformly applied (some documents have it, some don't).

It is unlikely that zone and parametric data is useful in the Reuters
collection. More often than not, users want to look up news articles
based on the content, rather than author.

However, there is a benefit to indexing by year (vis-a-vis date),
since users may want to look up news articles from a period in time.
It is also unlikely that users will remember and lookup the exact date
of the articles, else they would just go to an archive and not a
search engine.

Nevertheless, because the metadata quality and availability is not
uniform, introducing parametric indices would create a bias towards
articles that contain the metadata. The results from such a search may
be skewed and not ideal, but it is indeed better than having a lack of
any such information.

== Files included with this submission ==

index.py -- program to index the corpus
search.py -- program to process queries
dictionary.txt -- dictionary of terms mapped to frequencies and offsets
postings.txt -- length information, postings list
stopwords -- stopwords to be removed

== Statement of individual work ==

[X] I, JOYCE YEO SHUHUI, certify that I have followed the CS 3245 Information
Retrieval class guidelines for homework assignments.  In particular, I
expressly vow that I have followed the Facebook rule in discussing
with others in doing the assignment and did not take notes (digital or
printed) from the discussions.

== References ==
# nlargest for heap
	https://docs.python.org/3/library/heapq.html
# Verifying query results
	CHOW KENG JI + SOH JASON
	We discovered a mistake in my use of Porter Stemmer.
	Case-folding should be applied before stemming.

About

tf-idf retrieval system based on nltk reuters dataset, information retrieval assignment


Languages

Language:Python 100.0%