This project is about:
- 100% test coverage
- Speed
- Reasonable memory consumption
- SOCKS proxy support
- Runtime metrics API
from urllib.parse import urlsplit
from crawler import Crawler, Request
class TestCrawler(Crawler):
def task_generator(self):
for host in ('yandex.ru', 'github.com'):
yield Request('page', 'https://%s/' % host, meta={'host': host})
def handler_page(self, req, res):
title = res.xpath('//title').text(default='N/A')
print('Title of [%s]: %s' % (req.url, title))
ext_urls = set()
for elem in res.xpath('//a[@href]'):
url = elem.attr('href')
parts = urlsplit(url)
if parts.netloc and req.meta['host'] not in parts.netloc:
ext_urls.add(url)
print('External URLs:')
for url in ext_urls:
print(' * %s' % url)
bot = TestCrawler(num_network_threads=10)
bot.run()
- Install crawler with pip install crawler
- Run command crawl_start_project <project_name>
- cd into new directory
- Run make build
That'll build virtualenv with all things you need to start using crawler. To active virtualenv run pipenv shell command.
Put you crawler code into crawlers/ directory. Then run it with command crawl <CrawlerClassName>
pip install crawler