amking / spider

Scala Website Spider

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Scala Website Spider

Overview

This project is an implementation of a basic web spider in Scala, and it demonstrates the use of Actors in particular. The application's primary purpose is to traverse a website looking for broken internal or external links, which it will report at the end of processing. However, it has been designed so it can be extended and repurposed for related applications as required.

Getting Started

Eclipse Users

  • Ensure you have the Eclipse EGit and Scala IDE plugins installed
  • Copy the URL in the Git Read-Only entry field at the top of this web page to the clipboard
  • In Eclipse, execute WindowShow ViewOther...GitGit Repositories to make this view visible
  • Activate the context menu (right-mouse button) on the Git Repositories view and paste in the clipboard URL to start the EGit wizard
  • Accept the default values in the subsequent Source Git Repository and Branch Selection dialogs
  • In the final Local Destination dialog, select an appropriate local directory using the Browse button, then click Finish
  • Activate the context menu on the new local repository view entry, and select Import Projects...
  • Accept the defaults in the import wizard dialogs by clicking click Next and Finish
  • The newly imported project should now automatically build, firstly downloading the necessary dependencies then compiling the source code
  • It is also useful to have a Maven run configuration to build the project, as follows:
  • Activate the context menu on the newly imported project, then select Run AsRun Configurations...Maven BuildNew
  • Name the configuration spider build and set the goals as clean verify scala:doc, and in the Refresh tab tick the resources checkbox
  • Run the new configuration to build the application; the project should auto-build from now on
  • To run the application, similarly do context menu Run AsRun Configurations...Scala ApplicationNew
  • Name the new configuration spider run and set the main class as web.satyagraha.spider.app.SpiderApp; in the Arguments tab, add command line options (see below) and a URL for the web site to be scanned
  • If you wish to investigate the codebase further, you will find it useful to expand the Maven Dependencies folder, select one or more jars, then activate the context menu and select MavenDownload JavaDoc (and Sources too if required) - this will provide better context help

Non-Eclipse Users

  • Ensure you have Git and Maven installed on your system
  • Copy the URL in the Git Read-Only entry field at the top of this web page to the clipboard
  • Change working directory to an appropriate location for the checkout, then execute: git clone url
  • Change working directory to the newly created spider subdirectory
  • Execute: mvn clean verify scala:doc

Command Line Invocation

When successfully built, the executable may be invoked in a stand-alone way via the command:

java -jar target/spider-1.0-SNAPSHOT-jar-with-dependencies.jar [options] url

The available options are:

  • --cookies cookies - allows arbitrary cookies to be passed on the initial page GET, e.g. a session id captured via Firefox Live HTTP Headers
  • --loglevel level - allows SLF4J logging level to be set, e.g. to WARN
  • --readers count - allows number of reader actors to be varied from default of 10

Implementation

Codebase

The codebase uses the following key components:

Principles of Operation

Essentially the application works by passing around References, which are simple objects encoding the relationship between a web page and a link within that page: whenever we encounter a link, a Reference is generated which may need to be followed. We have a number of types of actors who handle various roles in the processing, and the interactions between these actors is best shown in a diagram thus:

diagram

The actors' responsibilities are as follows:

  • referenceActor - manages generation of References, which must be returned
  • targetActor - manages links previously seen
  • queuedActor - manages pool of readerActors
  • readerActor - reads and analyzes web page, returns itself to the queuedActor on completion
  • successActor - handles good links
  • failureActor - handles bad links

Scaladoc

The application's scaladoc will be found in target/site/scaladocs on completion of the Maven scaladoc action.

Notes

  • This application does not honour the robots.txt convention for spiders, and thus potentially can generate a high load on a website by traversing all its pages. This is particularly true if you set a high value for the --readers option. High loads can be unpopular and might lead to claims of Denial of Service or result in IP blocking, so be warned. The workload generated could in principle be throttled by restricting the number of readers and/or introducing a sleep period between HTTP requests. On the other hand, if you do actually want a load generation tool, this could be one approach.

  • The application does not attempt to fetch references defined in the HEAD section of a HTML document, e.g. CSS and JavaScript files; however, this should be a straight-forward extension if required.

  • Websites providing open-ended dynamic content links, like calendars in particular, may well result in non-termination of the application. Adding some kind of pattern match exclusion when determining whether links should be followed would most likely be the solution here.

  • A wide variety of errors will be seen when running the application against typical commercial websites. This is the reality of web content as seen in the wild! Naturally fixes to accommodate such anomalies are most welcome via the usual Github lifecycle, with corresponding unit tests being advisable to validate the changes.

About

Scala Website Spider


Languages

Language:Scala 98.3%Language:Groovy 1.7%