This project aims to be a web page scraper crawler service. It provides a REST interface to input a JSON formatted list of urls that will be qualified by the server and persisted in the repository. The endpoint is published at /crawler , it is the only accessible path and it expects an input as follows:
[ { "url": "centrallecheraasturiana.es", "rank": 834987
}, { "url": "guiafull.com", "rank": 571272 } ]
The endpoint is asynchronous and should return an HTTP status 200 immediately firing an asynchronous thread to process the urls.
Once processed the urls are stored in mongodb in a database named marfeel,
inside a collection named urls should be found after the first
request. The document stored contains the following fields:
uri:
the url processed
rank: the url's rank
marfeelizable:
whether the url is qualified or not
error: whether there
was an error while processing the url
Mongodb settings
can be found at src/main/webapp/WEB-INF/dispatch-servler.xml
To build the project simply run
mvn install
inside the unzipped project's folder. If successful this should
produce a war file inside the target folder which can be deployed in
the server.
Integration tests have been disabled , to run them you must have a running mongodb instance and configure src/main/webapp/WEB-INF/dispatch-servler.xml to match it, to enable the tests you will have to delete the @Ignore annotation found in src/main/test/com/itomas/webcrawler/CrawlerControllerTest.java at line 26