python web crawler with rich config file
Before install, you may want to run unittest. You can do it by
python unittest/unittest_all.py
python setup.py install
dingpa_crawl.py [config_file_name] [db_prefix] [shard] [total]
e.g.
dingpa_crawl.py test.conf test.db 1 10
The example command will use test.conf as config file, and save data in test.db.10.1. Here the shard/total is for sharding. And example command will only crawl urls whose hash mod 10 equals 1.
Following is a sample config
[edu]
url = http://zsb.bupt.edu.cn/
url = http://www.pku.edu.cn/
update = http://[a-z0-9]+.[a-z0-9]+.edu.cn/.*/
update = http://[a-z0-9]+.[a-z0-9]+.edu.cn/.*htm
update = http://[a-z0-9]+.[a-z0-9]+.edu.cn/.*html
[gov]
url = http://www.gov.cn
update = http://www.gov.cn/[a-z0-9]+/
Here, edu, gov is a group name of pages. url define seed urls. update define rules which use regex to filter pages you want to crawl.
dingpa use an embedded db CodernityDB to save downloaded data. We use embedded db instead of file system beacuse using filesystem will generate many small files which is hard to manage.
CodernityDB is a very fast no-sql embedded db for python.