SearchEngine
This is the repository for my Information Retrieval course project from ASU Spring 2015 course. It was implemented progressively.
update 0: TF-IDF based retrieval
update 1: added options for Authorites and Hubs
update 2: added options for page rank
To run the code. First you need to setup jython and install pylucene.
Then run the code as jython SearchEngine.py <lexicon_filename>
The code CreateLexicon.py
running as jython CreateLexicon.py <lexicon_filname>
will just create an index and save it down in a .pkl.gz
file. This also can save document norms pre-saved.
The code SearchFiles.py
is the re-writing of the code SearchFiles.java
into python.
The code LinkAnalysis.py
is the re-writing of the code LinkAnalysis.java
into python. This can be used to understand how pylucene
works.
These can be used to understand how pylucene
works.
Setting up SearchFiles.py
verbose = False
will run the code silently.create_lexicon_flag = True
. ifTrue
will rebuild lexicon from scratch, ifFalse
will load a pre-created one as supplied insys_arg[1]
normalize = False
. ifTrue
will use document norms and normalized tf-idf,False
will not.n_retrieves = 10
number of documents to retreivetf_idf_flag = True
True
retrieves based on Tf/idf, False retrieves based on only Tf.ah_flag = True
will run the code using Authorites and Hubs weighting.pr_flag = True
will run the code using Page Rank weighting.directory = '../index'
will setup the location of index.- having both the above
False
will simply run just TF-IDF. linksFile
andcitationsFile
are text files containing links and citations needed for HITS and pagerank algorithms.root_set_size
is a variable that determines the root set for hits algorithm.maxIter
determines how many iterations for power iterating to get eigen values.