dahlia / wikidata

Wikidata client library for Python

Home Page:https://pypi.org/project/Wikidata/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Empty response (JSONDecodeError) when sending many requests in a row

marccarre opened this issue · comments

Version

0.7.0

Problem

When sending several (10~100) requests in a row, some requests fail, without determinism, with the following error:

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Upon closer investigation, the actual response is a 429,

  • with "reason": Too many requests. Please comply with the User-Agent policy to get a higher rate limit: https://meta.wikimedia.org/wiki/User-Agent_policy, and
  • with the following "body":
<!DOCTYPE html>
<html lang="en">
<meta charset="utf-8">
<title>Wikimedia Error</title>
<style>
	* { margin: 0; padding: 0; }
body { background: #fff; font: 15px/1.6 sans-serif; color: #333; }
.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; }
.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f9f9; padding: 2em 0; font-size: 0.8em; text-align: center; }
img { float: left; margin: 0 2em 2em 0; }
a img { border: 0; }
h1 { margin-top: 1em; font-size: 1.2em; }
.content-text { overflow: hidden; overflow-wrap: break-word; word-wrap: break-word; -webkit-hyphens: auto; -moz-hyphens: auto; -ms-hyphens: auto; hyphens: auto; }
p { margin: 0.7em 0 1em 0; }
a { color: #0645ad; text-decoration: none; }
a:hover { text-decoration: underline; }
code { font-family: sans-serif; }
.text-muted { color: #777; }
</style>
<div class="content" role="main">
	<a href="https://www.wikimedia.org"><img src="https://www.wikimedia.org/static/images/wmf-logo.png" srcset="https://www.wikimedia.org/static/images/wmf-logo-2x.png 2x" alt="Wikimedia" width="135" height="101">
</a>
<h1>Error</h1>
<div class="content-text">
	<p>Our servers are currently under maintenance or experiencing a technical problem.

	Please <a href="" title="Reload this page" onclick="window.location.reload(false); return false">try again</a> in a few&nbsp;minutes.</p>

<p>See the error message at the bottom of this page for more&nbsp;information.</p>
</div>
</div>
<div class="footer"><p>If you report this error to the Wikimedia System Administrators, please include the details below.</p><p class="text-muted"><code>Request from 122.216.10.145 via cp5012 cp5012, Varnish XID 477962109<br>Upstream caches: cp5012 int<br>Error: 429, Too many requests. Please comply with the User-Agent policy to get a higher rate limit: https://meta.wikimedia.org/wiki/User-Agent_policy at Sun, 17 Jul 2022 22:28:20 GMT</code></p>
</div>
</html>

Root cause

This library doesn't follow Wikimedia's user-agent policy, specifically:

<client name>/<version> (<contact information>) <library/framework name>/<version> [<library name>/<version> ...]. Parts that are not applicable can be omitted.

which leads in a temporary rate limiting/blacklisting of the agent:

Requests from disallowed user agents may instead encounter a less helpful error message like this:
Our servers are currently experiencing a technical problem. Please try again in a few minutes.

See also: https://meta.wikimedia.org/wiki/User-Agent_policy

Solution

Set an User-Agent header compliant with the above policy, e.g.:

>>> import urllib
>>> od = urllib.request.OpenerDirector()
>>> od.addheaders 
[('User-agent', 'Python-urllib/3.9')]
>>> 
>>> import wikidata
>>> wikidata.__version__
'0.7.0'
>>> 
>>> import sys
>>> od.addheaders = { 
...     "Accept": "application/sparql-results+json",
...     "User-Agent": "wikidata-based-bot/%s (https://github.com/dahlia/wikidata ; hong.minhee@gmail.com) python/%s.%s.%s Wikidata/%s" % (wikidata.__version__, sys.version_info.major, sys.version_info.minor, sys.version_info.micro, wikidata.__version__),
... }
>>> 
>>> od.addheaders 
{'Accept': 'application/sparql-results+json', 'User-Agent': 'wikidata-based-bot/0.7.0 (https://github.com/dahlia/wikidata ; hong.minhee@gmail.com) python/3.9.13 Wikidata/0.7.0'}

By the way, doesn't WikiData provide rate limit header fields? If it has them we could intelligently control the request rate from client side.

I looked for this too but, unless I missed something, couldn't find such header in the response:

{'accept-ch': 'Sec-CH-UA-Arch,Sec-CH-UA-Bitness,Sec-CH-UA-Full-Version-List,Sec-CH-UA-Model,Sec-CH-UA-Platform-Version',
 'accept-ranges': 'bytes',
 'access-control-allow-origin': '*',
 'age': '1',
 'cache-control': 'public, max-age=300',
 'content-encoding': 'gzip',
 'content-type': 'application/sparql-results+json;charset=utf-8',
 'date': 'Fri, 22 Jul 2022 09:33:06 GMT',
 'nel': '{ "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, '
        '"success_fraction": 0.0}',
 'permissions-policy': 'interest-cohort=(),ch-ua-arch=(self '
                       '"intake-analytics.wikimedia.org"),ch-ua-bitness=(self '
                       '"intake-analytics.wikimedia.org"),ch-ua-full-version-list=(self '
                       '"intake-analytics.wikimedia.org"),ch-ua-model=(self '
                       '"intake-analytics.wikimedia.org"),ch-ua-platform-version=(self '
                       '"intake-analytics.wikimedia.org")',
 'report-to': '{ "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": '
              '"https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" '
              '}] }',
 'server': 'nginx/1.14.2',
 'server-timing': 'cache;desc="pass", host;desc="cp5008"',
 'set-cookie': 'WMF-Last-Access=22-Jul-2022;Path=/;HttpOnly;secure;Expires=Tue, '
               '23 Aug 2022 00:00:00 GMT, '
               'WMF-Last-Access-Global=22-Jul-2022;Path=/;Domain=.wikidata.org;HttpOnly;secure;Expires=Tue, '
               '23 Aug 2022 00:00:00 GMT',
 'strict-transport-security': 'max-age=106384710; includeSubDomains; preload',
 'transfer-encoding': 'chunked',
 'vary': 'Accept, Accept-Encoding',
 'x-cache': 'cp5009 miss, cp5008 pass',
 'x-cache-status': 'pass',
 'x-client-ip': '***.***.***.***',
 'x-first-solution-millis': '48',
 'x-served-by': 'wdqs2001'}

There seems to be only this recommendation on the request rate:
https://wikitech.wikimedia.org/wiki/Robot_policy#Request_rate

Thank you for letting me know that! 🙏🏼