Information-retrieval

This is my final project of information retrieval in college. In this project, we have already compared Standard Boolean Model, Extended Boolean Model and Vector Space Model for the task of searching chinese keywords and listed the advantage and drawback respectively.

Develop tools and techniques

Visual FoxPro

Standard Boolean Model

If we use this query to search documents,

羽球AND ( 全民OR**OR比賽 ) AND推廣

There are three related documents retrieved by Standard Boolean Model:

Advantage
- The result is predictable and easy to interpreted
- It can contain many logic of keywords
- Without calculate in advance, it can use bigram directly
Drawback
- The effeciveness is greatly depended on how user design the boolean query
- For simplt query, it will retrieve to many unrelated documents
- Complext query is too difficult to designed by user

Extended Boolean Model

If we use this query to search documents,

羽球AND ( 全民OR**OR比賽 ) AND推廣

We can still get these three documents but the hit is only one for each document.

Advantage
- It may find better document corresponding to the query because it use the sentence as unit
Drawback
- It is so strict that it can not find any documents

Vector Space Model

We calculate term frequency (tf) and inverse document frequency (idf) for each bigram first. Next, we wull calculate the tf-idf weighting. Finally, all documents and the query are transformed into weighted term vectors and we can use cosine similarity to measure how similar they are.

term frequency (tf): More frequent terms in a document are more important,

normalize term frequency in the document,
inverse document frequency (idf): Terms that appear in many different documents are less indicative overall topic,

Log is used to dampen the effect, *N* is total number of documents.
tf-idf weighting: A term occurring frequently in the document but rarly in the rest of documents is given high score,
cosine similarity measure: Use angle between two vectors to measure their similarity

Calculate the tf and idf for query term,

Advantage
- Based on mathematics
- Consider both local and global information
- Ranking
- Fast
Drawback
- Lack of semantic

Conclusion

Standard Boolean Model can find documents which are exactly match the query but the query need to be well designed.
Extended Boolean Model is more precise than Standard Boolean Model because it is base on sentence.
Vector Space Model is more intuitive to use because it can find the documents without designing the boolean query.
The top ranking documents in Vector Space Model always the documents we want.

Reference

[Information retrieval course of information management in Soochow University]

ChienKangLu / Information-Retrieval