seomoz / reppy

Modern robots.txt Parser for Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Incorrect result for mentioned URLs

rock321987 opened this issue · comments

Hi again,

PFA few of the URLs where incorrect result is returned (False here)

https://thesparkgroup.com/available-positions/
http://planate.net/careers/
http://www.halagroup.com/career.php?id=13

Most of the case that I found were:

  • Wordpress site
  • sites with status 404, 403 or giving error on robots.txt page
  • Last one mentioned in file seems to fail for no reason till I understand.

From what I understand these all URLs should return True. Correct me if I am wrong.

reppyFail.xlsx

Many sites return 403 for the default user agent of python-requests/<version> (tested using curl -i --header 'User-agent: python-requests/2.10.0' http://www.halagroup.com/robots.txt, including:

Per the original robots.txt RFC,

On server response indicating access restrictions (HTTP Status
Code 401 or 403) a robot should regard access to the site
completely restricted.

For the example from planate.net, it was successful for me when I tried it on the current master:

from reppy.robots import Robots
robots = Robots.fetch('http://planate.net/robots.txt')
# This returned True for me, as expected
robots.allowed('http://planate.net/careers/', 'foo-agent')

Are you using any custom headers, like providing your own user agent?

That's weird. I am getting False in python2.7, python3.5 and python3.6 for the code you mentioned above.. I have installed it both via pip and also from source.

Here is the snapshot

selection_052

Is there any way to know the status code returned during fetch of robots url? I consider sites giving these errors as allowed for all.

@rock321987: It could be due to IP-based blacklisting or dynamic banning by the site in question. Presumably, @dlecocq is using a different IP than you.

Can you confirm what the response is using either curl from the command line or pdb and extracting the response using reppy?

Also, it's considered polite to temporarily self-block on 5XX status codes for robots.txt similar to 401/403 status code because it's possible that the site is having issues due to crawling. For example, Google will not crawl your site if you return a 500 or 503 status code for robots.txt.

You could also add some logging to an after_response_hook to get the status code for the response.

@b4hand IP ban is not a problem. I am in anyway using proxy. Using wget I get 403 which is expected. I will try it on another machine(though it shouldn't be happening but just for sake of convincing myself) and get back to you.

How can I add that logging? Do I need to make modification in reppy's source code?

It's from requests. reppy will forward on any *args and **kwargs provided to Robots.fetch directly to requests.get, which allows you to do:

def after_response_hook(response):
    print 'Raw HTTP response: %s' % response

Robots.fetch(..., after_response_hook=after_response_hook)

The after_response_hook keyword argument allows you to have a callback once the raw response has been received.

I know this is closed, but I'm curious because I don't seem to see a way to do this - is it possible to provide a user-agent that reppy will use when fetching robots.txt? I'm getting 403s, and it's resolved by providing a user agent. With a library like robotexclusionrulesparser, you can set the user agent. Not sure if there's a similar option for reppy?

The Robots.fetch method will let you. It accepts **kwargs which are passed to requests, so you can say Robots.fetch(..., headers={'user-agent': 'my-awesome-user-agent'}). When using the cache, you can include the same **kwargs at cache creation time.

Ah, I had looked at the source but missed the use of self.args in the cache, thanks!