Trying scrapy-playwright
on Zyte Scrapy Cloud.
A custom Docker image is provided in order to install the system dependencies needed for the headless browsers.
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
_browsers = {
"chromium": "/ms-playwright/chromium/chrome-linux/chrome",
"firefox": "/ms-playwright/firefox/firefox/firefox",
"webkit": "/ms-playwright/webkit/pw_run.sh",
}
PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {
"executablePath": _browsers[PLAYWRIGHT_BROWSER_TYPE],
"timeout": 10000,
}
TWISTED_REACTOR
:scrapy-playwright
will only function with theasyncio
-based Twisted reactorDOWNLOAD_HANDLERS
: tells Scrapy to use the library's download handler to process requestsPLAYWRIGHT_LAUNCH_OPTIONS
: the Docker image will be executed by a non-root user, and hence the path to the browser executable needs to be set explicitly.
- Make sure you have
shub
installed - Replace the project id (
project: <project-id>
) in thescrapinghub.yml
file with your own project id - Run
shub image upload
- Run
shub schedule headers
For more information, check out the full documentation on how to build and deploy Docker images to Scrapy Cloud.