ShihabYasin / concurrent-webscrapping-python

Web Scraping with Python and Selenium ( Concurrent, Parallel, AsyncIO versions compare ) .

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Web Scraping with Python and Selenium ( Concurrent, Parallel, AsyncIO versions compare )

How to run:

  1. Prepare virtual environment.
$ python -m venv env
$ source env/bin/activate
(env)$ pip install -r requirements.txt
  1. Install ChromeDriver globally( e.g. this one )

  2. Run different scrapers:

# sync styled, makes 20 requests to https://en.wikipedia.org/wiki/Special:Random
(env)$ python script.py headless

# parallel with multiprocessing
(env)$ python script_parallel_1.py headless

# parallel with concurrent.futures
(env)$ python script_parallel_2.py headless

# concurrent with concurrent.futures (according to theory should be the fastest)
(env)$ python script_concurrent.py headless

# parallel with concurrent.futures and concurrent with asyncio
(env)$ python script_asyncio.py headless

Will get output like below:

(env)$ python script.py
Scraping Wikipedia #1 time(s)...
Scraping Wikipedia #2 time(s)...
...
Scraping Wikipedia #19 time(s)...
Scraping Wikipedia #20 time(s)...
Elapsed run time: 57.36561393737793 seconds
(env)$ python script_concurrent.py

Elapsed run time: 11.831077098846436 seconds
(env)$ python script_concurrent.py headless

Running in headless mode
Elapsed run time: 6.222846269607544 seconds
  1. Run the tests:
(env)$ python -m pytest test/test_scraper_mock.py
(env)$ python -m pytest test/test_scraper.py

Will get output like below:

(env)$ python -m pytest test/test_scraper_mock.py

================================ test session starts =================================
platform darwin -- Python 3.10.0, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /Users/michael/repos/testdriven/async-web-scraping
collected 3 items

test/test_scraper.py ...                                                       [100%]

================================= 3 passed in 0.27s =================================
(env)$ python -m pytest test/test_scraper.py

================================ test session starts =================================
platform darwin -- Python 3.10.0, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /Users/michael/repos/testdriven/async-web-scraping
collected 3 items

test/test_scraper.py ...                                                       [100%]

================================= 3 passed in 0.19 ==================================

About

Web Scraping with Python and Selenium ( Concurrent, Parallel, AsyncIO versions compare ) .

License:MIT License


Languages

Language:Python 96.8%Language:Shell 3.2%