Simple Site Crawler

Pre-requisites

Getting started

There are at least three containers running at any given time. A RabbitMQ message broker, MongoDB data store and task worker. The crawler makes use of Celery, the distributed task queue library, it's use of which is more apparent when you scale multiple workers.

To get started, run the following command

make start_containers

Before starting the crawl you may wish to increase the number of worker containers, which by default is one.

docker scale worker=[NUMBER_OF_WORKERS]

Once things are up and running you can start a crawl of all the urls found in the crawlable_urls.txt file.

make start_crawl

The results of the crawl are saved into Mongo. You can connect to MongoDB from your host using the address localhost:27018.

Thanks

Tony Wang and the article 'How to build a scaleable crawler...' for which the code base is originally based upon.
DomCop for the link to the Open PageRank data used in lieu of the Alexa top million.

About

A simple page crawler using Docker, Python and Celery.

Languages

Language:Python 53.5%Language:Shell 35.1%Language:Makefile 11.4%