seomoz / reppy

Modern robots.txt Parser for Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

https issue with robots.fetch()

DannyCork opened this issue · comments

I have tried this on Mac OS X and Ubuntu and am encountering the same issue.

This is my code:

import reppy
from reppy.robots import Robots

robots = Robots.fetch('http://daft.ie/robots.txt')
print(robots)

When I execute the above , nothing happens.
When I interrupt the execution ,this is the error

$ python3 test.py 
^CTraceback (most recent call last):
  File "test.py", line 5, in <module>
    robots = Robots.fetch('http://daft.ie/robots.txt')
  File "reppy/robots.pyx", line 100, in reppy.robots.FetchMethod
  File "/home/xxx/.local/lib/python3.6/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/home/xxx/.local/lib/python3.6/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/xxx/.local/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/xxx/.local/lib/python3.6/site-packages/requests/sessions.py", line 665, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "/home/xxx/.local/lib/python3.6/site-packages/requests/sessions.py", line 665, in <listcomp>
    history = [resp for resp in gen] if allow_redirects else []
  File "/home/xxx/.local/lib/python3.6/site-packages/requests/sessions.py", line 245, in resolve_redirects
    **adapter_kwargs
  File "/home/xxx/.local/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/home/xxx/.local/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/home/xxx/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/home/xxx/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/xxx/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.6/http/client.py", line 1356, in getresponse
    response.begin()
  File "/usr/lib/python3.6/http/client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.6/http/client.py", line 268, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
  File "/home/xxx/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 313, in recv_into
    return self.connection.recv_into(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/OpenSSL/SSL.py", line 1716, in recv_into
    result = _lib.SSL_read(self._ssl, buf, nbytes)
KeyboardInterrupt

It's hanging on the ssl connection to the host :(

When I try a http only site eg neverssl.com, execution completes successfully.

python3 test.py 
{"*": [Directive(Disallow: /)]}

Any ideas? Thanks.

appears there is something funky with the domain I was trying. I tried a bunch of other domains without issue.