cih-y2k / scaling-to-distributed-crawling

Home Page:https://www.zenrows.com/blog/mastering-web-scraping-in-python-scaling-to-distributed-crawling

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

crawling-scale-up

Repository for the Mastering Web Scraping in Python: Scaling to Distributed Crawling blogpost with the final code.

Installation

You will need Redis and python3 installed. After that, install all the necessary libraries by running pip install.

pip install install requests beautifulsoup4 playwright "celery[redis]"
npx playwright install

Execute

Configure the Redis connection on the repo file and Celery on the tasks file.

You need to start Celery and the run the main script that will start queueing pages to crawl.

celery -A tasks worker
python3 main.py 

Contributing

Pull requests are welcome. For significant changes, please open an issue first to discuss what you would like to change.

License

MIT

About

https://www.zenrows.com/blog/mastering-web-scraping-in-python-scaling-to-distributed-crawling

License:MIT License


Languages

Language:HTML 98.9%Language:Python 0.6%Language:JavaScript 0.5%