Information Retrieval final project at USI, Lugano
First install scrapy (requires Python, good luck if you're on Windows):
$ pip install scrapy
Then move into the crawler directory and run the desired spider(s):
$ cd crawler
crawler$ scrapy crawl -o ../data/imdb_result.json imdb
crawler$ scrapy crawl -o ../data/rottentomatoes_result.json rottentomatoes
crawler$ scrapy crawl -o ../data/allmovie_result.json allmovie
This will save all the data in JSON format in the file specified with the -o
parameter.
Use the included Docker image to create your collection and spin up the UI webserver:
$ docker-compose up -d
Then feed it the data that you crawled (must be in the data
directory and in
a supported format):
$ docker exec ir-project_solr post -c movies data/*
That's it. Head over to localhost:3000
and start browsing!
Start the Solr server (requires Java):
$ solr-8.7.0/bin/solr start
Then create and index the crawled collection:
$ solr-8.7.0/bin/solr create -c movies -d movies
$ solr-8.7.0/bin/post -c movies ../data/*
Then start the webserver:
$ yarn start
and go to localhost:3000
to use the search UI. It's that simple!
If you used Docker, simply shut down the containers with
$ docker-compose down
Otherwise kill the UI webserver, then delete the Solr collection:
$ solr-8.7.0/bin/solr delete -c movies
Stop Solr:
$ solr-8.7.0/bin/solr stop -all