benedictflorance / holi-search-engine

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Holi Search Engine

Run arguments for testing on local machine:
KVS Master: 8000
KVS Worker: 8001 localhost:8000
Flame Master: 9000
Flame Worker: 9001 localhost:9000

Run arguments for EC2:
KVS Master: sudo java -cp bin cis5550.kvs.Master 8000
KVS Worker: sudo java -cp bin cis5550.kvs.Worker 8001 worker1
Flame Master: sudo java -cp bin cis5550.flame.Master 9000
Flame Worker: sudo java -cp bin cis5550.flame.Worker 9001


Flame submit(run this command on a local machine): java -cp bin cis5550.flame.FlameSubmit crawler.jar

Restart crawler:

To restart the crawler after pausing, find in the kvs worker directory a lengthy-named table that contains column key and url. That's the queue we should continue with. You may safety delete the other lengthy-named table, which should be empty. Start the crawler with the usual setting, but in the arguments provided to the crawler, put in "-t table-name" (Don't put ".table" after the name). The crawler should pick up the queue. The initial crawls may show "Already attempted," and this is normal because we start from the beginning of the queue, which may contain urls we have tried.

Indexer, PageRank, TermFrequency and Idf:

  • To compile use javac cis5550/jobs/ && jar -cf xxx.jar cis5550/jobs/xxx.class
  • To run use java cis5550.flame.FlameSubmit localhost:9000 xxx.jar



  • Descending sort of 3 * w_td * w_tq + 0.75 * page_rank, where
    • w_td uses Euclidean normalized tf_weighting without use of idf
    • w_tq is the product of the query term prequency and the idf raised to 1.5
    • page_rank is the the page rank value resulting out of the iterative page rank algorithm.


Build Instructions

  • To run the ranker, use java -cp lib/\*:src cis5550.ranker.Ranker port kvs_ip:kvs_port

Response Format

JSON stringified Java objects, where each object has three fields:

  • url - url of the webpage,
  • title - title of the webpage (max of 60 characters),
  • page_head - snippet of the webpage (max of 300 characters)

Usage Instructions

  • Has a GET request method with path /search with
    • a required query parameter q that has the encoded query phrase string
    • an optional query parameter page that defaults to 1 (first page).
  • Each page returns 10 urls. Pages beyond the number of matching urls will return empty strings.



javac -cp "lib/*" --source-path src src/cis5550/webserver/


java -cp "../lib/*;" cis5550.webserver.TestServer <frontend-server port> <ranker ip:port> <kvs ip:port>


Open browser tab at frontend-server port

Code Explanation

  • TestServer class defined routes:
    • /: this route simply return the home page
    • /search?q=&p=: this route expects a json object like below:
class SearchResult {
    String title;
    String url;
    String page_head;

    public SearchResult(String title, String url, String page_head) {
        this.title = title;
        this.url = url;

class SearchResultsResponse {
    List<SearchResult> results;
    int page;
    int totalPages;

    public SearchResultsResponse(List<SearchResult> results, int page, int totalPages) {
        this.results = results; = page;
        this.totalPages = totalPages;



Language:Java 92.9%Language:JavaScript 6.9%Language:Shell 0.2%