seomoz / reppy

Modern robots.txt Parser for Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

error while decompressing data: incorrect header check

azotlikid opened this issue · comments

commented

Hello,

I get errors with some urls :

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/response.py", line 262, in _decode
    data = self._decoder.decompress(data)
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/response.py", line 62, in decompress
    return self._obj.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect header check

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "reppy/robots.pyx", line 78, in reppy.robots.FetchMethod (reppy/robots.cpp:3235)
  File "reppy/robots.pyx", line 79, in reppy.robots.FetchMethod (reppy/robots.cpp:2706)
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/response.py", line 404, in read
    data = self._decode(data, decode_content, flush_decoder)
  File "/usr/local/lib/python3.5/dist-packages/requests/packages/urllib3/response.py", line 267, in _decode
    "failed to decode it." % content_encoding, e)
requests.packages.urllib3.exceptions.DecodeError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check',))

steps to reproduce :

from reppy.robots import Robots
robots = Robots.fetch('http://stackoverflow.com/robots.txt')
robots = Robots.fetch('http://askubuntu.com/robots.txt')
robots = Robots.fetch('http://superuser.com/robots.txt')

but fetching doesn't fail with urllib3 only

>>> import urllib3
>>> http = urllib3.PoolManager()
>>> r = http.request('GET','http://superuser.com/robots.txt')
>>> r.status
200
>>> r.data
b"User-Agent: *\r\nDisallow:     ....

We use requests, not urllib3 directly, and I can reproduce it with just requests:

>>> import requests
>>> r = requests.get('http://stackoverflow.com/robots.txt')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/brandon/.pyenv/versions/2.7.11/lib/python2.7/site-packages/requests/api.py", line 67, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/brandon/.pyenv/versions/2.7.11/lib/python2.7/site-packages/requests/api.py", line 53, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/brandon/.pyenv/versions/2.7.11/lib/python2.7/site-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/brandon/.pyenv/versions/2.7.11/lib/python2.7/site-packages/requests/sessions.py", line 608, in send
    r.content
  File "/Users/brandon/.pyenv/versions/2.7.11/lib/python2.7/site-packages/requests/models.py", line 737, in content
    self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
  File "/Users/brandon/.pyenv/versions/2.7.11/lib/python2.7/site-packages/requests/models.py", line 665, in generate
    raise ContentDecodingError(e)
requests.exceptions.ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing: incorrect header check',))

So this appears to be an issue with requests, not reppy.

commented

requests use urllib3 (source) as explained in the README : [requests] is powered by urllib3, which is embedded within Requests.

At first glance, it's an issue with requests and not with reppy.

Yes, I'm well aware that requests uses urllib3, I was just pointing out the fact that this bug is somewhere outside of reppy.

I'm going to close this out as "Won't Fix" because we're not going to implement workarounds for servers sending bad data back. If the content is not gzipped, but the headers are reporting it is gzipped, that's an error on the server's part. Doing arbitrary content detection is notoriously tricky and potentially a security issue.