Hamuul / scrape-challenge

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Running :

Prerequisites:

  • Ensure you have already executed python3 get_xmls.py or obtained the necessary .xml files and placed them in the current directory. Then run python3 sel.py or python3 playwright_scraper.py and sit back for a few hours.

Scraping results are in products.txt for selenium and pw/full.csv for playwright

Option 1: Docker container

  1. Download the Dockerfile and :
    wget https://raw.githubusercontent.com/Hamuul/scrape-challenge/master/Dockerfile
    docker buildx build --network=host --tag rewe_scraper .
    

Option 2: Native

  1. Clone the repo:
    git clone https://github.com/Hamuul/sphereMonitor.git
  2. Deal with system dependencies (example is for apt-based distros):
    pip install -r requirements.txt
    playwright install chromium && playwright install-deps
    sudo apt install libxml2-dev lixbxslt-dev -y 
    

Initial mapping/analysis of problem:

  • After some digging we find that http://shop.rewe.de exposes sitemaps at /sitemaps . We notice that there are 2 xml files which contain most/all products. However, upon trying to use cURL with various user-agents and headers to spoof our scraper, we are faced with 403 Forbidden status codes.

  • Although they would greatly improve performance, 403 is what aiottp, requests, httpx and other popular python packages also return when trying to fetch the HTML source. Furthermore, if the site suspects programmatic access of any kind, it requests a captcha which is a significant hindrance to scraping.

  • Some searching reveals that there is an API available https://shop.rewe.de/api/products?market=<market_id>&objectsPerPage=<num> however using it would beat the purpose of a scraper. It also might rate limit scripts, but I have not tested that.

Scraping product details

  • We use LXML package because it generally offers the best parsing performance in python.
  • In the page source of each product there is a field matched by a css selector like id=pdpr-propstore* which contains most/all product info we wanted in a nice JSON (actually, much more than that, like nutritional info, price per liter/kg, allergens etc)
  • If we're fine with prices like '980' or '1199' which represent 9.80€ and 11.99€ and know we must post-process them in code, that is actually all we need to scrape.
  • We can also scrape the already formatted price with a CSS selector #pdpr-ProductIinformation + Xpath expression //meso-data[@data-context="product-detail"] but this may hurt performance

Selenium solution

  • Trying selenium, the headless webdriver can thankfully access any page of the marketplace, and avoids 403 and captcha.

  • Fetch the xml files which contain all product urls with selenium and parse them into a collection (list/generator), stripping whatever is not inside tags.

  • We end up with a list of about 70k product urls. We spawn NUM_DRIVERS (tests suggest 4 and 8 work decently) headless webdrivers which fetch their html source successively and writes to a .txt file with shell redirection.

  • Selenium webdrivers are pretty slow to boot up and pretty slow to fetch the html. Also, they are not compatible with async. Overall this leads to a slow process. It took more than 4 hours to extract the product data with this approach.

Playwright

  • After research we find out playwright supports async and suspect it might offer better performance.
  • We port most of the logic from the selenium scraper but try to optimize file I/O and leverage concurrency to improve performance.
  • Switch from writing to .txt to writing to .csv

curl-impersonate + ripgrep + jq + xargs

  • Needs to run from WSL in Windows' filesystem hierarchy. Sounds bad but it's the fastest solution, upwards of 100 requests/second.
  • Only this specific curl client works, others will return 403 / face captcha.

Miscellanous

  • Fine-tuning of parameters like timeout, chunk_size, NUM_DRIVERS may net better performance but they require very thorough and broad profiling which is time-consuming in itself.
  • The scraper would probably benefit from being ported to a faster language like Java / C#. It felt really hard to go above an average ~5 products details extracted and written to disk per second.
  • Though some products are not currently in stock, their details are still available in the HTML source, though not rendered. I have chosen to keep track of them too.

About


Languages

Language:Python 85.1%Language:Dockerfile 11.2%Language:Shell 2.7%Language:Batchfile 1.0%