nathanbrock / simple-site-crawler

A simple page crawler using Docker, Python and Celery.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Simple Site Crawler

Pre-requisites

Getting started

There are at least three containers running at any given time. A RabbitMQ message broker, MongoDB data store and task worker. The crawler makes use of Celery, the distributed task queue library, it's use of which is more apparent when you scale multiple workers.

To get started, run the following command

make start_containers

Before starting the crawl you may wish to increase the number of worker containers, which by default is one.

docker scale worker=[NUMBER_OF_WORKERS]

Once things are up and running you can start a crawl of all the urls found in the crawlable_urls.txt file.

make start_crawl

The results of the crawl are saved into Mongo. You can connect to MongoDB from your host using the address localhost:27018.

Thanks

About

A simple page crawler using Docker, Python and Celery.


Languages

Language:Python 53.5%Language:Shell 35.1%Language:Makefile 11.4%