Scala Website Spider

Overview

This project is an implementation of a basic web spider in Scala, and it demonstrates the use of Actors in particular. The application's primary purpose is to traverse a website looking for broken internal or external links, which it will report at the end of processing. However, it has been designed so it can be extended and repurposed for related applications as required.

Getting Started

Eclipse Users

Ensure you have the Eclipse EGit and Scala IDE plugins installed
Copy the URL in the Git Read-Only entry field at the top of this web page to the clipboard
In Eclipse, execute Window → Show View → Other... → Git → Git Repositories to make this view visible
Activate the context menu (right-mouse button) on the Git Repositories view and paste in the clipboard URL to start the EGit wizard
Accept the default values in the subsequent Source Git Repository and Branch Selection dialogs
In the final Local Destination dialog, select an appropriate local directory using the Browse button, then click Finish
Activate the context menu on the new local repository view entry, and select Import Projects...
Accept the defaults in the import wizard dialogs by clicking click Next and Finish
The newly imported project should now automatically build, firstly downloading the necessary dependencies then compiling the source code
It is also useful to have a Maven run configuration to build the project, as follows:
Activate the context menu on the newly imported project, then select Run As → Run Configurations... → Maven Build → New
Name the configuration spider build and set the goals as clean verify scala:doc, and in the Refresh tab tick the resources checkbox
Run the new configuration to build the application; the project should auto-build from now on
To run the application, similarly do context menu Run As → Run Configurations... → Scala Application → New
Name the new configuration spider run and set the main class as web.satyagraha.spider.app.SpiderApp; in the Arguments tab, add command line options (see below) and a URL for the web site to be scanned
If you wish to investigate the codebase further, you will find it useful to expand the Maven Dependencies folder, select one or more jars, then activate the context menu and select Maven → Download JavaDoc (and Sources too if required) - this will provide better context help

Non-Eclipse Users

Ensure you have Git and Maven installed on your system
Copy the URL in the Git Read-Only entry field at the top of this web page to the clipboard
Change working directory to an appropriate location for the checkout, then execute: git clone url
Change working directory to the newly created spider subdirectory
Execute: mvn clean verify scala:doc

Command Line Invocation

When successfully built, the executable may be invoked in a stand-alone way via the command:

java -jar target/spider-1.0-SNAPSHOT-jar-with-dependencies.jar [options] url

The available options are:

--cookies cookies - allows arbitrary cookies to be passed on the initial page GET, e.g. a session id captured via Firefox Live HTTP Headers
--loglevel level - allows SLF4J logging level to be set, e.g. to WARN
--readers count - allows number of reader actors to be varied from default of 10

Implementation

Codebase

The codebase uses the following key components:

Scala 2.9
The ScalaTest testing framework
The Mockito mocking framwework for unit testing
The SubCut dependency injection framework
The Dispatch web framework
The Lift Actors framework
The Grizzled, SLF4J, and Logback logging stack

Principles of Operation

Essentially the application works by passing around References, which are simple objects encoding the relationship between a web page and a link within that page: whenever we encounter a link, a Reference is generated which may need to be followed. We have a number of types of actors who handle various roles in the processing, and the interactions between these actors is best shown in a diagram thus:

The actors' responsibilities are as follows:

referenceActor - manages generation of References, which must be returned
targetActor - manages links previously seen
queuedActor - manages pool of readerActors
readerActor - reads and analyzes web page, returns itself to the queuedActor on completion
successActor - handles good links
failureActor - handles bad links

Scaladoc

The application's scaladoc will be found in target/site/scaladocs on completion of the Maven scaladoc action.

Notes

This application does not honour the robots.txt convention for spiders, and thus potentially can generate a high load on a website by traversing all its pages. This is particularly true if you set a high value for the --readers option. High loads can be unpopular and might lead to claims of Denial of Service or result in IP blocking, so be warned. The workload generated could in principle be throttled by restricting the number of readers and/or introducing a sleep period between HTTP requests. On the other hand, if you do actually want a load generation tool, this could be one approach.
The application does not attempt to fetch references defined in the HEAD section of a HTML document, e.g. CSS and JavaScript files; however, this should be a straight-forward extension if required.
Websites providing open-ended dynamic content links, like calendars in particular, may well result in non-termination of the application. Adding some kind of pattern match exclusion when determining whether links should be followed would most likely be the solution here.
A wide variety of errors will be seen when running the application against typical commercial websites. This is the reality of web content as seen in the wild! Naturally fixes to accommodate such anomalies are most welcome via the usual Github lifecycle, with corresponding unit tests being advisable to validate the changes.

amking / spider