ShihabYasin / selenium-with-docker-swarm

Concurrent Web Scraping with Selenium Grid and Docker Swarm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Concurrent Web Scraping with Selenium Grid and Docker Swarm

Dependencies:

  1. Docker v20.10.13
  2. Python v3.10.4
  3. Selenium v4.1.3
  • Docker v20.10.13
  • Python v3.10.4
  • Selenium v4.1.3
  • First Thing First

    Start by cloning down the base project with the web scraping script, create and activate a virtual environment, and install the dependencies:

    $ git clone https://github.com/ShihabYasin/selenium-with-docker-swarm.git --branch base --single-branch
    $ cd selenium-grid-docker-swarm
    $ python3.10 -m venv env
    $ source env/bin/activate
    (env)$ pip install -r requirements.txt
    

    The above commands may differ depending on your environment.

    Test out the scraper:

    (env)$ python project/script.py
    

    You should see something similar to:

    Scraping random Wikipedia page...
    [
    {
    'url': 'https://en.wikipedia.org/wiki/Andreas_Reinke',
    'title': 'Andreas Reinke',
    'last_modified': ' This page was last edited on 10 January 2022, at 23:11\xa0(UTC).'
    }
    ]
    Finished!
    

    Essentially, the script makes a request to Wikipedia:Random -- https://en.wikipedia.org/wiki/Special:Random -- for information about the random article using Selenium to automate interaction with the site and Beautiful Soup to parse the HTML.

    Configuring Selenium Grid

    Next, let's spin up Selenium Grid to simplify the running of the script in parallel on multiple machines. We'll also use Docker and Docker Compose to manage those machines with minimal installation and configuration.

    Add a docker-compose.yml file to the root directory:

    version: '3.8'
    

    services:

    hub: image: selenium/hub:4.1.3 ports: - 4442:4442 - 4443:4443 - 4444:4444

    chrome: image: selenium/node-chrome:4.1.3 depends_on: - hub environment: - SE_EVENT_BUS_HOST=hub - SE_EVENT_BUS_PUBLISH_PORT=4442 - SE_EVENT_BUS_SUBSCRIBE_PORT=4443

    Here, we used the official Selenium Docker images to set up a basic Selenium Grid that consists of a hub and a single Chrome node. We used the 4.1.3 tag, which is associated with the following versions of Selenium, WebDriver, Chrome, and Firefox:

  • Selenium: 4.1.3
  • Google Chrome: 99.0.4844.84
  • ChromeDriver: 99.0.4844.51
  • Mozilla Firefox: 98.0.2
  • Geckodriver: 0.30.0
  • Want to use different versions? Find the appropriate tag from the releases page.

    Pull and run the images:

    $ docker-compose up -d
    

    Navigate to http://localhost:4444 in your browser to ensure that the hub is up and running with one Chrome node:

    Since Selenium Hub is running on a different machine (within the Docker container), we need to configure the remote driver in project/scrapers/scraper.py:

    def get_driver():
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")
    

    # initialize driver driver = webdriver.Remote( command_executor='http://localhost:4444/wd/hub', desired_capabilities=DesiredCapabilities.CHROME) return driver

    Add the import:

    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
    

    Run the scraper again:

    (env)$ python project/script.py
    

    While the scraper is running, you should see "Sessions" change to one, indicating that it's in use:

    About

    Concurrent Web Scraping with Selenium Grid and Docker Swarm

    License:MIT License


    Languages

    Language:Python 66.0%Language:Shell 34.0%