jdevelop / webspider

Open WEB spider platform

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

webspider Build Status

Open WEB spider platform. Uses Akka Cluster for distributed processing, along with Distributed PubSub.

The webspider-demo module contains the simple web application that starts one task scheduler node, and couple of web processing nodes, and exposes the interface at http://localhost:8080/

Planned features

  • extract text from HTML/PDF documents
  • process only documents, matching given patterns in names/content types
  • extract data using XPath expressions from not well-formed HTML pages or XHTML ones
  • maintain website graph (links between ancestor / successor pages)
  • process websites behind the authentication (HTTP Basic/Digest, Form-Based authentication)
  • handle failures and restart processing from point where application was aborted
  • provide extension API for document type handlers, protocol handlers
  • concurrent processing of website pages
  • minimize traffic using bzip/gzip encoding when possible, avoid donloading of same link twice or more times

Supported protocols:

  • HTTP(S)

About

Open WEB spider platform


Languages

Language:Scala 97.3%Language:HTML 2.0%Language:JavaScript 0.4%Language:CSS 0.3%