dhet / scrawl

A distributed website crawler for analyzing website structures written in Scala

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Scrawl

Scrawl is a highly parallel and distributed webscraper written in Scala. It is designed as a CLI which takes a base URI of a website as input and then scrapes that website, as well as all of its children, recursively. Scrawl aims to achieve a high level of concurrency. For this, akka's actor system is employed: Each web document is scraped by a separate actor, which in turn spawns a number of other actors for each link it encounters while scraping.

This project was created by David Hettler and Nico Hein as part of a semester project at LMU Munich.

The documentation is written in German and can be found here.

Usage

$ scrawl URL [PARAMETERS]

Parameters:

  • -level <level>, synonyms: -l, -d, -depth
    • Determines how many levels to crawl. Expects one argument of type Integer.
  • -out, synonyms: -o, -path, -dir
    • Specifies the directory where to save the result. The files are saved in the ./sitemaps directory by default. The name of the xml files correspond to the crawled sites
  • -help, synonyms: -h, -?, -wat
    • Prints a help text

About

A distributed website crawler for analyzing website structures written in Scala


Languages

Language:Scala 100.0%