denny64 / scrapy-playwright-cloud-example

Trying scrapy-playwright on Scrapy Cloud

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

scrapy-playwright sample project for Scrapy Cloud

Trying scrapy-playwright on Zyte Scrapy Cloud.

Dockerfile

A custom Docker image is provided in order to install the system dependencies needed for the headless browsers.

Settings

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

_browsers = {
    "chromium": "/ms-playwright/chromium/chrome-linux/chrome",
    "firefox": "/ms-playwright/firefox/firefox/firefox",
    "webkit": "/ms-playwright/webkit/pw_run.sh",
}
PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {
    "executablePath": _browsers[PLAYWRIGHT_BROWSER_TYPE],
    "timeout": 10000,
}
  • TWISTED_REACTOR: scrapy-playwright will only function with the asyncio-based Twisted reactor
  • DOWNLOAD_HANDLERS: tells Scrapy to use the library's download handler to process requests
  • PLAYWRIGHT_LAUNCH_OPTIONS: the Docker image will be executed by a non-root user, and hence the path to the browser executable needs to be set explicitly.

Build and deploy

  • Make sure you have shub installed
  • Replace the project id (project: <project-id>) in the scrapinghub.yml file with your own project id
  • Run shub image upload
  • Run shub schedule headers

For more information, check out the full documentation on how to build and deploy Docker images to Scrapy Cloud.

About

Trying scrapy-playwright on Scrapy Cloud


Languages

Language:Python 83.0%Language:Dockerfile 17.0%