Dependencies:
- Docker v20.10.13
- Python v3.10.4
- Selenium v4.1.3
Start by cloning down the base project with the web scraping script, create and activate a virtual environment, and install the dependencies:
$ git clone https://github.com/ShihabYasin/selenium-with-docker-swarm.git --branch base --single-branch
$ cd selenium-grid-docker-swarm
$ python3.10 -m venv env
$ source env/bin/activate
(env)$ pip install -r requirements.txt
The above commands may differ depending on your environment.
Test out the scraper:
(env)$ python project/script.py
You should see something similar to:
Scraping random Wikipedia page...
[
{
'url': 'https://en.wikipedia.org/wiki/Andreas_Reinke',
'title': 'Andreas Reinke',
'last_modified': ' This page was last edited on 10 January 2022, at 23:11\xa0(UTC).'
}
]
Finished!
Essentially, the script makes a request to Wikipedia:Random -- https://en.wikipedia.org/wiki/Special:Random
-- for information about the random article using Selenium to automate interaction with the site and Beautiful Soup to parse the HTML.
Next, let's spin up Selenium Grid to simplify the running of the script in parallel on multiple machines. We'll also use Docker and Docker Compose to manage those machines with minimal installation and configuration.
Add a docker-compose.yml file to the root directory:
version: '3.8'
services:
hub: image: selenium/hub:4.1.3 ports: - 4442:4442 - 4443:4443 - 4444:4444
chrome: image: selenium/node-chrome:4.1.3 depends_on: - hub environment: - SE_EVENT_BUS_HOST=hub - SE_EVENT_BUS_PUBLISH_PORT=4442 - SE_EVENT_BUS_SUBSCRIBE_PORT=4443
Here, we used the official Selenium Docker images to set up a basic Selenium Grid that consists of a hub and a single Chrome node. We used the 4.1.3
tag, which is associated with the following versions of Selenium, WebDriver, Chrome, and Firefox:
Want to use different versions? Find the appropriate tag from the releases page.
Pull and run the images:
$ docker-compose up -d
Navigate to http://localhost:4444 in your browser to ensure that the hub is up and running with one Chrome node:
Since Selenium Hub is running on a different machine (within the Docker container), we need to configure the remote driver in project/scrapers/scraper.py:
def get_driver(): options = webdriver.ChromeOptions() options.add_argument("--headless")
# initialize driver driver = webdriver.Remote( command_executor='http://localhost:4444/wd/hub', desired_capabilities=DesiredCapabilities.CHROME) return driver
Add the import:
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
Run the scraper again:
(env)$ python project/script.py
While the scraper is running, you should see "Sessions" change to one, indicating that it's in use: