Workaholic Webcrawler

An extensible multi-threaded Web crawler. Useful in a multinode system where DB is centralized or clustered.

Features:

Uses sqlite database.
Batch Processing of DB query.
Efficiently uses two queues
- raw queue -> which the slave pulls out of.
- final queue -> which the manager pulls out of and verifies it using DB calls.
Each node has 1 manager and x slaves.
The manager queries the central database.
The database also stores "backlinks" -> the number of backlinks to a particular page.
User-Agent Header is customizable.
Switch Between urllib2 (Good socks support) or the requests library.
Has 3 effectively implemented mutex locks.

git clone https://github.com/Sadhanandh/Workaholic-WebCrawler.git

If you dont have the lxml library already then:
On Ubuntu/Debian:

sudo apt-get install python-lxml

or
On Windows/Mac/*nix

pip install lxml

./webcrawl.py --urls "http://github.com http://twitter.com"

Other options:

-d / --depth "depth of crawling" 
-t / threads "number of threads ie number of slaves per node " 
-b / batch   "batch query limit"

Use Multi-clustered MySql DB server co-operating with a Node community.
Separate IO bound slaves (that uses threading library) and CPU bound slaves (that uses multiprocessing library) to maximize the throughput.
Respect Robot.txt (DB)
Support Multi-Proxy IP's and also every page of the base url should be crawled through the same proxy. (DB)
Dynamically terminate and create threads (Slaves) based on queue's idle time and system variables.
Implement a shareable queue b/w - "nodes" and b/w - "I/O bound Slaves and CPU bound Slaves".
Extend the above using Load balancing / Load sharing techniques.

Find all the links in the readable area(excluding the usually user ignored header,footer and irrelevant areas of the page)