error while decompressing data: incorrect header check
azotlikid opened this issue · comments
Hello,
I get errors with some urls :
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/response.py", line 262, in _decode
data = self._decoder.decompress(data)
File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/response.py", line 62, in decompress
return self._obj.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect header check
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "reppy/robots.pyx", line 78, in reppy.robots.FetchMethod (reppy/robots.cpp:3235)
File "reppy/robots.pyx", line 79, in reppy.robots.FetchMethod (reppy/robots.cpp:2706)
File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/response.py", line 404, in read
data = self._decode(data, decode_content, flush_decoder)
File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/response.py", line 267, in _decode
"failed to decode it." % content_encoding, e)
requests.packages.urllib3.exceptions.DecodeError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check',))
steps to reproduce :
from reppy.robots import Robots
robots = Robots.fetch('http://stackoverflow.com/robots.txt')
robots = Robots.fetch('http://askubuntu.com/robots.txt')
robots = Robots.fetch('http://superuser.com/robots.txt')
but fetching doesn't fail with urllib3 only
>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request('GET','http://superuser.com/robots.txt')
>>> r.status
200
>>> r.data
b"User-Agent: *\r\nDisallow: ....
We use requests
, not urllib3
directly, and I can reproduce it with just requests:
>>> import requests
>>> r = requests.get('http://stackoverflow.com/robots.txt')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/brandon/.pyenv/versions/2.7.11/lib/python2.7/site-packages/requests/api.py", line 67, in get
return request('get', url, params=params, **kwargs)
File "/Users/brandon/.pyenv/versions/2.7.11/lib/python2.7/site-packages/requests/api.py", line 53, in request
return session.request(method=method, url=url, **kwargs)
File "/Users/brandon/.pyenv/versions/2.7.11/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
resp = self.send(prep, **send_kwargs)
File "/Users/brandon/.pyenv/versions/2.7.11/lib/python2.7/site-packages/requests/sessions.py", line 608, in send
r.content
File "/Users/brandon/.pyenv/versions/2.7.11/lib/python2.7/site-packages/requests/models.py", line 737, in content
self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
File "/Users/brandon/.pyenv/versions/2.7.11/lib/python2.7/site-packages/requests/models.py", line 665, in generate
raise ContentDecodingError(e)
requests.exceptions.ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing: incorrect header check',))
So this appears to be an issue with requests
, not reppy
.
requests
use urllib3
(source) as explained in the README : [requests] is powered by urllib3, which is embedded within Requests.
At first glance, it's an issue with requests
and not with reppy
.
Yes, I'm well aware that requests
uses urllib3
, I was just pointing out the fact that this bug is somewhere outside of reppy
.
I'm going to close this out as "Won't Fix" because we're not going to implement workarounds for servers sending bad data back. If the content is not gzipped, but the headers are reporting it is gzipped, that's an error on the server's part. Doing arbitrary content detection is notoriously tricky and potentially a security issue.