iagotomas / webcrawler

A basic webcrawler

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

webcrawler

A basic webcrawler

This project aims to be a web page scraper crawler service. It provides a REST interface to input a JSON formatted list of urls that will be qualified by the server and persisted in the repository. The endpoint is published at /crawler , it is the only accessible path and it expects an input as follows:

[ { "url": "centrallecheraasturiana.es", "rank": 834987 }, { "url": "guiafull.com", "rank": 571272 } ]

The endpoint is asynchronous and should return an HTTP status 200 immediately firing an asynchronous thread to process the urls.

Once processed the urls are stored in mongodb in a database named marfeel, inside a collection named urls should be found after the first request. The document stored contains the following fields:
uri: the url processed
rank: the url's rank
marfeelizable: whether the url is qualified or not
error: whether there was an error while processing the url

Mongodb settings can be found at src/main/webapp/WEB-INF/dispatch-servler.xml

Build

To build the project simply run mvn install inside the unzipped project's folder. If successful this should produce a war file inside the target folder which can be deployed in the server.

Note

Integration tests have been disabled , to run them you must have a running mongodb instance and configure src/main/webapp/WEB-INF/dispatch-servler.xml to match it, to enable the tests you will have to delete the @Ignore annotation found in src/main/test/com/itomas/webcrawler/CrawlerControllerTest.java at line 26

About

A basic webcrawler


Languages

Language:Java 100.0%