IR-project

Information Retrieval final project at USI, Lugano

Step 1: Crawling (skip if you want to use the included data)

First install scrapy (requires Python, good luck if you're on Windows):

$ pip install scrapy

Then move into the crawler directory and run the desired spider(s):

$ cd crawler

crawler$ scrapy crawl -o ../data/imdb_result.json imdb

crawler$ scrapy crawl -o ../data/rottentomatoes_result.json rottentomatoes

crawler$ scrapy crawl -o ../data/allmovie_result.json allmovie

This will save all the data in JSON format in the file specified with the -o parameter.

Step 2: Indexing and browsing (the easy/prod way)

Use the included Docker image to create your collection and spin up the UI webserver:

$ docker-compose up -d

Then feed it the data that you crawled (must be in the data directory and in a supported format):

$ docker exec ir-project_solr post -c movies data/*

That's it. Head over to localhost:3000 and start browsing!

Step 2x: Manually indexing and browsing (the hard/dev way)

Start the Solr server (requires Java):

$ solr-8.7.0/bin/solr start

Then create and index the crawled collection:

$ solr-8.7.0/bin/solr create -c movies -d movies

$ solr-8.7.0/bin/post -c movies ../data/*

Then start the webserver:

$ yarn start

and go to localhost:3000 to use the search UI. It's that simple!

Cleanup

If you used Docker, simply shut down the containers with

$ docker-compose down

Otherwise kill the UI webserver, then delete the Solr collection:

$ solr-8.7.0/bin/solr delete -c movies

Stop Solr:

$ solr-8.7.0/bin/solr stop -all

About

Information Retrieval final project at USI, Lugano

GNU General Public License v3.0

Languages

Language:HTML 30.4%Language:CSS 24.4%Language:Shell 23.3%Language:Batchfile 16.3%Language:JavaScript 3.2%Language:Python 2.2%Language:AMPL 0.1%Language:Dockerfile 0.0%