Scrapy plugin for Zyte API.
- Python 3.7+
- Scrapy 2.0.1+
pip install scrapy-zyte-api
To enable this plugin:
- Set the
http
andhttps
keys in the DOWNLOAD_HANDLERS Scrapy setting to"scrapy_zyte_api.ScrapyZyteAPIDownloadHandler"
. - Add
"scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware"
to the DOWNLOADER_MIDDLEWARES Scrapy setting with any value, e.g.1000
. - Set the REQUEST_FINGERPRINTER_CLASS Scrapy setting to
"scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter"
. - Set the TWISTED_REACTOR Scrapy setting to
"twisted.internet.asyncioreactor.AsyncioSelectorReactor"
. - Set your Zyte API key as either the
ZYTE_API_KEY
Scrapy setting or as an environment variable of the same name.
For example, in the settings.py
file of your Scrapy project:
DOWNLOAD_HANDLERS = {
"http": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
"https": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
}
DOWNLOADER_MIDDLEWARES = {
"scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 1000,
}
REQUEST_FINGERPRINTER_CLASS = "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
ZYTE_API_KEY = "YOUR_API_KEY"
The ZYTE_API_ENABLED
setting, which is True
by default, can be set to False
to disable this plugin.
You can send requests through Zyte API in one of the following ways:
- Send all request through Zyte API by default, letting Zyte API parameters be chosen automatically based on your Scrapy request parameters. See Using transparent mode below.
- Send specific requests through Zyte API, setting all Zyte API parameters manually, keeping full control of what is sent to Zyte API. See Sending requests with manually-defined parameters below.
- Send specific requests through Zyte API, letting Zyte API parameters be chosen automatically based on your Scrapy request parameters. See Sending requests with automatically-mapped parameters below.
Zyte API response parameters are mapped into Scrapy response parameters where possible. See Response mapping below for details.
Set the ZYTE_API_TRANSPARENT_MODE
Scrapy setting to True
to handle Scrapy requests as follows:
By default, requests are sent through Zyte API with automatically-mapped parameters. See Sending requests with automatically-mapped parameters below for details about automatic request parameter mapping.
You do not need to set the
zyte_api_automap
request meta key toTrue
, but you can set it to a dictionary to extend your Zyte API request parameters.- Requests with the
zyte_api
request meta key set to adict
are sent through Zyte API with manually-defined parameters. See Sending requests with manually-defined parameters below. - Requests with the
zyte_api_automap
request meta key set toFalse
are not sent through Zyte API.
For example:
import scrapy
class SampleQuotesSpider(scrapy.Spider):
name = "sample_quotes"
start_urls = ["https://quotes.toscrape.com/"]
custom_settings = {
"ZYTE_API_TRANSPARENT_MODE": True,
}
def parse(self, response):
print(response.text)
# "<html>…</html>"
To send a Scrapy request through Zyte API with manually-defined parameters, define your Zyte API parameters in the zyte_api
key in Request.meta as a dict
.
The only exception is the url
parameter, which should not be defined as a Zyte API parameter. The value from Request.url
is used automatically.
For example:
import scrapy
class SampleQuotesSpider(scrapy.Spider):
name = "sample_quotes"
def start_requests(self):
yield scrapy.Request(
url="https://quotes.toscrape.com/",
meta={
"zyte_api": {
"browserHtml": True,
}
},
)
def parse(self, response):
print(response.text)
# "<html>…</html>"
Note that response headers are necessary for raw response decoding. When defining parameters manually and requesting httpResponseBody
extraction, remember to also request httpResponseHeaders
extraction:
import scrapy
class SampleQuotesSpider(scrapy.Spider):
name = "sample_quotes"
def start_requests(self):
yield scrapy.Request(
url="https://quotes.toscrape.com/",
meta={
"zyte_api": {
"httpResponseBody": True,
"httpResponseHeaders": True,
}
},
)
def parse(self, response):
print(response.text)
# "<html>…</html>"
To learn more about Zyte API parameters, see the data extraction usage and API reference pages of the Zyte API documentation.
To send a Scrapy request through Zyte API letting Zyte API parameters be automatically chosen based on the parameters of that Scrapy request, set the zyte_api_automap
key in Request.meta to True
.
For example:
import scrapy
class SampleQuotesSpider(scrapy.Spider):
name = "sample_quotes"
def start_requests(self):
yield scrapy.Request(
url="https://quotes.toscrape.com/",
meta={
"zyte_api_automap": True,
},
)
def parse(self, response):
print(response.text)
# "<html>…</html>"
See also Using transparent mode above and Automated request parameter mapping below.
Zyte API responses are mapped with one of the following classes:
scrapy_zyte_api.responses.ZyteAPITextResponse
, a subclass ofscrapy.http.TextResponse
, is used to map text responses, i.e. responses withbrowserHtml
or responses with bothhttpResponseBody
andhttpResponseHeaders
with a text body (e.g. plain text, HTML, JSON).scrapy_zyte_api.responses.ZyteAPIResponse
, a subclass ofscrapy.http.Response
, is used to map any other response.
Zyte API response parameters are mapped into response class attributes where possible:
url
becomesresponse.url
.statusCode
becomesresponse.status
.httpResponseHeaders
andexperimental.responseCookies
becomeresponse.headers
.experimental.responseCookies
is also mapped into the request cookiejar.browserHtml
andhttpResponseBody
are mapped into bothresponse.text
(str
) andresponse.body
(bytes
).If none of these parameters were present, e.g. if the only requested output was
screenshot
,response.text
andresponse.body
would be empty.If a future version of Zyte API supported requesting both outputs on the same request, and both parameters were present,
browserHtml
would be the one mapped intoresponse.text
andresponse.body
.
Both response classes have a raw_api_response
attribute that contains a dict
with the complete, raw response from Zyte API, where you can find all Zyte API response parameters, including those that are not mapped into other response class atttributes.
For example, for a request for httpResponseBody
and httpResponseHeaders
, you would get:
def parse(self, response):
print(response.url)
# "https://quotes.toscrape.com/"
print(response.status)
# 200
print(response.headers)
# {b"Content-Type": [b"text/html"], …}
print(response.text)
# "<html>…</html>"
print(response.body)
# b"<html>…</html>"
print(response.raw_api_response)
# {
# "url": "https://quotes.toscrape.com/",
# "statusCode": 200,
# "httpResponseBody": "PGh0bWw+4oCmPC9odG1sPg==",
# "httpResponseHeaders": […],
# }
For a request for screenshot
, on the other hand, the response would look as follows:
def parse(self, response):
print(response.url)
# "https://quotes.toscrape.com/"
print(response.status)
# 200
print(response.headers)
# {}
print(response.text)
# ""
print(response.body)
# b""
print(response.raw_api_response)
# {
# "url": "https://quotes.toscrape.com/",
# "statusCode": 200,
# "screenshot": "iVBORw0KGgoAAAANSUh…",
# }
from base64 import b64decode
print(b64decode(response.raw_api_response["screenshot"]))
# b'\x89PNG\r\n\x1a\n\x00\x00\x00\r…'
When you enable automated request parameter mapping, be it through transparent mode (see Using transparent mode above) or for a specific request (see Sending requests with automatically-mapped parameters above), Zyte API parameters are chosen as follows by default:
Request.url
becomesurl
, same as in requests with manually-defined parameters.- If
Request.method
is something other than"GET"
, it becomeshttpRequestMethod
. Request.headers
becomecustomHttpRequestHeaders
.Request.body
becomeshttpRequestBody
.- If the
ZYTE_API_EXPERIMENTAL_COOKIES_ENABLED
Scrapy setting isTrue
, the COOKIES_ENABLED Scrapy setting isTrue
(default), and provided request metadata does not set dont_merge_cookies toTrue
:experimental.responseCookies
is set toTrue
.Cookies from the request cookie jar become
experimental.requestCookies
.All cookies from the cookie jar are set, regardless of their cookie domain. This is because Zyte API requests may involve requests to different domains (e.g. when following cross-domain redirects, or during browser rendering).
If the cookies to be set exceed the limit defined in the
ZYTE_API_MAX_COOKIES
setting (100 by default), a warning is logged, and only as many cookies as the limit allows are set for the target request. To silence this warning, setexperimental.requestCookies
manually, e.g. to an empty dict. Alternatively, if Zyte API starts supporting more than 100 request cookies, update theZYTE_API_MAX_COOKIES
setting accordingly.If you are using a custom downloader middleware to handle request cookiejars, you can point the
ZYTE_API_COOKIE_MIDDLEWARE
setting to its import path to make scrapy-zyte-api work with it. The downloader middleware is expected to have ajars
property with the same signature as in the built-in Scrapy downloader middleware for cookie handling.
httpResponseBody
andhttpResponseHeaders
are set toTrue
.This is subject to change without prior notice in future versions of scrapy-zyte-api, so please account for the following:
If you are requesting a binary resource, such as a PDF file or an image file, set
httpResponseBody
toTrue
explicitly in your requests:Request( url="https://toscrape.com/img/zyte.png", meta={ "zyte_api_automap": {"httpResponseBody": True}, }, )
In the future, we may stop setting
httpResponseBody
toTrue
by default, and instead use a different, new Zyte API parameter that only works for non-binary responses (e.g. HMTL, JSON, plain text).If you need to access response headers, be it through
response.headers
or throughresponse.raw_api_response["httpResponseHeaders"]
, sethttpResponseHeaders
toTrue
explicitly in your requests:Request( url="https://toscrape.com/", meta={ "zyte_api_automap": {"httpResponseHeaders": True}, }, )
At the moment we request response headers because some response headers are necessary to properly decode the response body as text. In the future, Zyte API may be able to handle this decoding automatically, so we would stop setting
httpResponseHeaders
toTrue
by default.
For example, the following Scrapy request:
Request(
method="POST"
url="https://httpbin.org/anything",
headers={"Content-Type": "application/json"},
body=b'{"foo": "bar"}',
cookies={"a": "b"},
)
Results in a request to the Zyte API data extraction endpoint with the following parameters:
{
"customHttpRequestHeaders": [
{
"name": "Content-Type",
"value": "application/json"
}
],
"experimental": {
"requestCookies": [
{
"name": "a",
"value": "b",
"domain": ""
}
],
"responseCookies": true
},
"httpResponseBody": true,
"httpResponseHeaders": true,
"httpRequestBody": "eyJmb28iOiAiYmFyIn0=",
"httpRequestMethod": "POST",
"url": "https://httpbin.org/anything"
}
You may set the zyte_api_automap
key in Request.meta to a dict
of Zyte API parameters to extend or override choices made by automated request parameter mapping.
Setting browserHtml
or screenshot
to True
unsets httpResponseBody
and httpResponseHeaders
, and makes Request.headers
become requestHeaders
instead of customHttpRequestHeaders
. For example, the following Scrapy request:
Request(
url="https://quotes.toscrape.com",
headers={"Referer": "https://example.com/"},
meta={"zyte_api_automap": {"browserHtml": True}},
)
Results in a request to the Zyte API data extraction endpoint with the following parameters:
{
"browserHtml": true,
"experimental": {
"responseCookies": true
},
"requestHeaders": {"referer": "https://example.com/"},
"url": "https://quotes.toscrape.com"
}
When mapping headers, headers not supported by Zyte API are excluded from the mapping by default. Use the following Scrapy settings to change which headers are included or excluded from header mapping:
ZYTE_API_SKIP_HEADERS
determines headers that must not be mapped ascustomHttpRequestHeaders
, and its default value is:["User-Agent"]
ZYTE_API_BROWSER_HEADERS
determines headers that can be mapped asrequestHeaders
. It is adict
, where keys are header names and values are the key that represents them inrequestHeaders
. Its default value is:{"Referer": "referer"}
To maximize support for potential future changes in Zyte API, automated request parameter mapping allows some parameter values and parameter combinations that Zyte API does not currently support, and may never support:
Request.method
becomeshttpRequestMethod
even for unsupportedhttpRequestMethod
values, and even ifhttpResponseBody
is unset.You can set
customHttpRequestHeaders
orrequestHeaders
toTrue
to force their mapping fromRequest.headers
in scenarios where they would not be mapped otherwise.Conversely, you can set
customHttpRequestHeaders
orrequestHeaders
toFalse
to prevent their mapping fromRequest.headers
.Request.body
becomeshttpRequestBody
even ifhttpResponseBody
is unset.- You can set
httpResponseBody
toFalse
(which unsets the parameter), and not setbrowserHtml
orscreenshot
toTrue
. In this case,Request.headers
is mapped asrequestHeaders
. - You can set
httpResponseBody
toTrue
and also setbrowserHtml
orscreenshot
toTrue
. In this case,Request.headers
is mapped both ascustomHttpRequestHeaders
and asrequestHeaders
, andbrowserHtml
is used as the Scrapy response body.
Often the same configuration needs to be used for all Zyte API requests. For example, all requests may need to set the same geolocation, or the spider only uses browserHtml
requests.
The following settings allow you to define Zyte API parameters to be included in all requests:
ZYTE_API_DEFAULT_PARAMS
is adict
of parameters to be combined with manually-defined parameters. See Sending requests with manually-defined parameters above.You may set the
zyte_api
request meta key to an emptydict
to only use default parameters for that request.ZYTE_API_AUTOMAP_PARAMS
is adict
of parameters to be combined with automatically-mapped parameters. See Sending requests with automatically-mapped parameters above.
For example, if you set ZYTE_API_DEFAULT_PARAMS
to {"geolocation": "US"}
and zyte_api
to {"browserHtml": True}
, {"url: "…", "geolocation": "US", "browserHtml": True}
is sent to Zyte API.
Parameters in these settings are merged with request-specific parameters, with request-specific parameters taking precedence.
ZYTE_API_DEFAULT_PARAMS
has no effect on requests that use automated request parameter mapping, and ZYTE_API_AUTOMAP_PARAMS
has no effect on requests that use manually-defined parameters.
When using transparent mode (see Using transparent mode above), be careful of which parameters you define through ZYTE_API_AUTOMAP_PARAMS
. In transparent mode, all Scrapy requests go through Zyte API, even requests that Scrapy sends automatically, such as those for robots.txt
files when ROBOTSTXT_OBEY is True
, or those for sitemaps when using a sitemap spider. Certain parameters, like browserHtml
or screenshot
, are not meant to be used for every single request.
API requests are retried automatically using the default retry policy of python-zyte-api.
API requests that exceed retries are dropped. You cannot manage API request retries through Scrapy downloader middlewares.
Use the ZYTE_API_RETRY_POLICY
setting or the zyte_api_retry_policy
request meta key to override the default python-zyte-api retry policy with a custom retry policy.
A custom retry policy must be an instance of tenacity.AsyncRetrying.
Scrapy settings must be picklable, which retry policies are not, so you cannot assign retry policy objects directly to the ZYTE_API_RETRY_POLICY
setting, and must use their import path string instead.
When setting a retry policy through request meta, you can assign the zyte_api_retry_policy
request meta key either the retry policy object itself or its import path string. If you need your requests to be serializable, however, you may also need to use the import path string.
For example, to also retry HTTP 521 errors the same as HTTP 520 errors, you can subclass RetryFactory as follows:
# project/retry_policies.py
from tenacity import retry_if_exception, RetryCallState
from zyte_api.aio.errors import RequestError
from zyte_api.aio.retry import RetryFactory
def is_http_521(exc: BaseException) -> bool:
return isinstance(exc, RequestError) and exc.status == 521
class CustomRetryFactory(RetryFactory):
retry_condition = (
RetryFactory.retry_condition
| retry_if_exception(is_http_521)
)
def wait(self, retry_state: RetryCallState) -> float:
if is_http_521(retry_state.outcome.exception()):
return self.temporary_download_error_wait(retry_state=retry_state)
return super().wait(retry_state)
def stop(self, retry_state: RetryCallState) -> bool:
if is_http_521(retry_state.outcome.exception()):
return self.temporary_download_error_stop(retry_state)
return super().stop(retry_state)
CUSTOM_RETRY_POLICY = CustomRetryFactory().build()
# project/settings.py
ZYTE_API_RETRY_POLICY = "project.retry_policies.CUSTOM_RETRY_POLICY"
Stats from python-zyte-api are exposed as Scrapy stats with the scrapy-zyte-api
prefix.
The request fingerprinter class of this plugin ensures that Scrapy 2.7 and later generate unique request fingerprints for Zyte API requests based on some of their parameters.
For example, a request for browserHtml
and a request for screenshot
with the same target URL are considered different requests. Similarly, requests with the same target URL but different actions
are also considered different requests.
The request fingerprinter class of this plugin generates request fingerprints for Zyte API requests based on the following Zyte API parameters:
url
(canonicalized)For URLs that include a URL fragment, like
https://example.com#foo
, URL canonicalization keeps the URL fragment ifbrowserHtml
orscreenshot
are enabled.- Request attribute parameters (
httpRequestBody
,httpRequestMethod
) - Output parameters (
browserHtml
,httpResponseBody
,httpResponseHeaders
,screenshot
) - Rendering option parameters (
actions
,javascript
,screenshotOptions
) geolocation
The following Zyte API parameters are not taken into account for request fingerprinting:
- Request header parameters (
customHttpRequestHeaders
,requestHeaders
) - Metadata parameters (
echoData
,jobId
) - Experimental parameters (
experimental
)
You can assign a request fingerprinter class to the ZYTE_API_FALLBACK_REQUEST_FINGERPRINTER_CLASS
Scrapy setting to configure a custom request fingerprinter class to use for requests that do not go through Zyte API:
ZYTE_API_FALLBACK_REQUEST_FINGERPRINTER_CLASS = "custom.RequestFingerprinter"
By default, requests that do not go through Zyte API use the default request fingerprinter class of the installed Scrapy version.
If you have a Scrapy version older than Scrapy 2.7, Zyte API parameters are not taken into account for request fingerprinting. This can cause some Scrapy components, like the filter of duplicate requests or the HTTP cache extension, to interpret 2 different requests as being the same.
To avoid most issues, use automated request parameter mapping, either through transparent mode or setting zyte_api_automap
to True
in Request.meta
, and then use Request
attributes instead of Request.meta
as much as possible. Unlike Request.meta
, Request
attributes do affect request fingerprints in Scrapy versions older than Scrapy 2.7.
For requests that must have the same Request
attributes but should still be considered different, such as browser-based requests with different URL fragments, you can set dont_filter
to True
on Request.meta
to prevent the duplicate filter of Scrapy to filter any of them out. For example:
yield Request(
"https://toscrape.com#1",
meta={"zyte_api_automap": {"browserHtml": True}},
dont_filter=True,
)
yield Request(
"https://toscrape.com#2",
meta={"zyte_api_automap": {"browserHtml": True}},
dont_filter=True,
)
Note, however, that for other Scrapy components, like the HTTP cache extensions, these 2 requests would still be considered identical.
Set the ZYTE_API_LOG_REQUESTS
setting to True
to enable the logging of debug messages that indicate the JSON object sent on every extract request to Zyte API.
For example:
Sending Zyte API extract request: {"url": "https://example.com", "httpResponseBody": true}
The ZYTE_API_LOG_REQUESTS_TRUNCATE
, 64 by default, determines the maximum length of any string value in the logged JSON object, excluding object keys. To disable truncation, set it to 0.