docker docker-compose docker-swarm python selenium selenium-grid webscraping

Concurrent Web Scraping with Selenium Grid and Docker Swarm

Dependencies:

Docker v20.10.13
Python v3.10.4
Selenium v4.1.3

Docker v20.10.13

Python v3.10.4

Selenium v4.1.3

First Thing First

Start by cloning down the base project with the web scraping script, create and activate a virtual environment, and install the dependencies:

$ git clone https://github.com/ShihabYasin/selenium-with-docker-swarm.git --branch base --single-branch
$ cd selenium-grid-docker-swarm
$ python3.10 -m venv env
$ source env/bin/activate
(env)$ pip install -r requirements.txt

The above commands may differ depending on your environment.

Test out the scraper:

(env)$ python project/script.py

You should see something similar to:

Scraping random Wikipedia page...
[
{
'url': 'https://en.wikipedia.org/wiki/Andreas_Reinke',
'title': 'Andreas Reinke',
'last_modified': ' This page was last edited on 10 January 2022, at 23:11\xa0(UTC).'
}
]
Finished!

Essentially, the script makes a request to Wikipedia:Random -- https://en.wikipedia.org/wiki/Special:Random -- for information about the random article using Selenium to automate interaction with the site and Beautiful Soup to parse the HTML.

Configuring Selenium Grid

Next, let's spin up Selenium Grid to simplify the running of the script in parallel on multiple machines. We'll also use Docker and Docker Compose to manage those machines with minimal installation and configuration.

Add a docker-compose.yml file to the root directory:

version: '3.8' services: hub: image: selenium/hub:4.1.3 ports: - 4442:4442 - 4443:4443 - 4444:4444

chrome: image: selenium/node-chrome:4.1.3 depends_on: - hub environment: - SE_EVENT_BUS_HOST=hub - SE_EVENT_BUS_PUBLISH_PORT=4442 - SE_EVENT_BUS_SUBSCRIBE_PORT=4443

Here, we used the official Selenium Docker images to set up a basic Selenium Grid that consists of a hub and a single Chrome node. We used the 4.1.3 tag, which is associated with the following versions of Selenium, WebDriver, Chrome, and Firefox:

Selenium: 4.1.3

Google Chrome: 99.0.4844.84

ChromeDriver: 99.0.4844.51

Mozilla Firefox: 98.0.2

Geckodriver: 0.30.0

Want to use different versions? Find the appropriate tag from the releases page.

Pull and run the images:

$ docker-compose up -d

Navigate to http://localhost:4444 in your browser to ensure that the hub is up and running with one Chrome node:

Since Selenium Hub is running on a different machine (within the Docker container), we need to configure the remote driver in project/scrapers/scraper.py:

def get_driver():
options = webdriver.ChromeOptions()
options.add_argument("--headless")
# initialize driver
driver = webdriver.Remote(
command_executor='http://localhost:4444/wd/hub',
desired_capabilities=DesiredCapabilities.CHROME)
return driver

Add the import:

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

Run the scraper again:

(env)$ python project/script.py

While the scraper is running, you should see "Sessions" change to one, indicating that it's in use:

About

Concurrent Web Scraping with Selenium Grid and Docker Swarm

docker docker-compose docker-swarm python selenium selenium-grid webscraping

MIT License

Languages

Language:Python 66.0%Language:Shell 34.0%