pombredanne / crawlit

Python web crawler with limitations

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

=============================== crawlit

Python web crawler with limitations.

Installation

  • $ git clone https://github.com/kracekumar/crawlit.git
  • $ cd crawlit
  • $ sudo python setup.py install or $ pip install -r requirements

Usage:

Crawl python.org

  • $ crawlit http://python.org

New directory will be created and all html files will be dumped.

Crawl only 2000 page from python.org

  • $ crawlit http://python.org --count 2000

Features

  • Single threaded
  • Auto recovery of crawler
  • Obeys Robots rule
  • Crawls links from same domain
  • Downloads only html files
  • Uses requests stream option so headers are fetched and body is fetched when needed

TODO

  • Add multiprocessing support for multi domain urls

About

Python web crawler with limitations

License:BSD 3-Clause "New" or "Revised" License