error while decompressing data: incorrect header check

Question

error while decompressing data: incorrect header check

azotlikid opened this issue 7 years ago · comments

Hello,

I get errors with some urls :

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/response.py", line 262, in _decode
    data = self._decoder.decompress(data)
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/response.py", line 62, in decompress
    return self._obj.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect header check

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "reppy/robots.pyx", line 78, in reppy.robots.FetchMethod (reppy/robots.cpp:3235)
  File "reppy/robots.pyx", line 79, in reppy.robots.FetchMethod (reppy/robots.cpp:2706)
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/response.py", line 404, in read
    data = self._decode(data, decode_content, flush_decoder)
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/response.py", line 267, in _decode
    "failed to decode it." % content_encoding, e)
requests.packages.urllib3.exceptions.DecodeError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check',))

steps to reproduce :

from reppy.robots import Robots
robots = Robots.fetch('http://stackoverflow.com/robots.txt')
robots = Robots.fetch('http://askubuntu.com/robots.txt')
robots = Robots.fetch('http://superuser.com/robots.txt')

but fetching doesn't fail with urllib3 only

>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request('GET','http://superuser.com/robots.txt')
>>> r.status
200
>>> r.data
b"User-Agent: *\r\nDisallow:     ....

magic commented 7 years ago

#3840

Brandon Forehand · Answer 1 · Fri Feb 10 2017 01:43:16 GMT+0800 (China Standard Time)

We use requests, not urllib3 directly, and I can reproduce it with just requests:

>>> import requests
>>> r = requests.get('http://stackoverflow.com/robots.txt')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/brandon/.pyenv/versions/2.7.11/lib/python2.7/site-packages/requests/api.py", line 67, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/brandon/.pyenv/versions/2.7.11/lib/python2.7/site-packages/requests/api.py", line 53, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/brandon/.pyenv/versions/2.7.11/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/brandon/.pyenv/versions/2.7.11/lib/python2.7/site-packages/requests/sessions.py", line 608, in send
    r.content
  File "/Users/brandon/.pyenv/versions/2.7.11/lib/python2.7/site-packages/requests/models.py", line 737, in content
    self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
  File "/Users/brandon/.pyenv/versions/2.7.11/lib/python2.7/site-packages/requests/models.py", line 665, in generate
    raise ContentDecodingError(e)
requests.exceptions.ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing: incorrect header check',))

So this appears to be an issue with requests, not reppy.

magic · Answer 2 · Fri Feb 10 2017 04:52:34 GMT+0800 (China Standard Time)

requests use urllib3 (source) as explained in the README : [requests] is powered by urllib3, which is embedded within Requests.

At first glance, it's an issue with requests and not with reppy.

Brandon Forehand · Answer 3 · Fri Feb 10 2017 07:59:12 GMT+0800 (China Standard Time)

Yes, I'm well aware that requests uses urllib3, I was just pointing out the fact that this bug is somewhere outside of reppy.

Brandon Forehand · Answer 4 · Thu Sep 07 2017 01:30:52 GMT+0800 (China Standard Time)

I'm going to close this out as "Won't Fix" because we're not going to implement workarounds for servers sending bad data back. If the content is not gzipped, but the headers are reporting it is gzipped, that's an error on the server's part. Doing arbitrary content detection is notoriously tricky and potentially a security issue.