darthbear / scrapy-simple-http-queue

Scrapy Plugin to use the simple http queue as the queue for the URLs in order to allow distributed crawling

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

scrapy-simple-http-queue

Scrapy Plugin to use the simple http queue as the queue for the URLs in order to allow distributed crawling.

First run simple-http-queue:

cd externals/simple-http-queue/simple_http_queue
python HttpQueue.py /tmp/queue.dat 8888

Initialize externals libs:

git submodule init
git submodule update

Example: run_example.sh

In settings.py:

HTTP_HOST (default is localhost)
HTTP_PORT (default is 8888)
SCHEDULER_PERSIST (default is True)
SCHEDULER_QUEUE_NAME (default is the name of the spider)
QUEUE_TYPE: FIFO (default) or LIFO

Use FIFO if you want to do a breadth-first crawling. Use LIFO if you want to do a depth-first crawling.

LIFO will consume less memory as the queue will be shorter when crawling pages.

About

Scrapy Plugin to use the simple http queue as the queue for the URLs in order to allow distributed crawling


Languages

Language:Python 97.5%Language:Shell 2.5%