Incorrect result for mentioned URLs

Question

Incorrect result for mentioned URLs

rock321987 opened this issue 6 years ago · comments

rock321987 commented 6 years ago

Hi again,

PFA few of the URLs where incorrect result is returned (False here)

https://thesparkgroup.com/available-positions/
http://planate.net/careers/
http://www.halagroup.com/career.php?id=13

Most of the case that I found were:

Wordpress site
sites with status 404, 403 or giving error on robots.txt page
Last one mentioned in file seems to fail for no reason till I understand.

From what I understand these all URLs should return True. Correct me if I am wrong.

reppyFail.xlsx

Dan Lecocq · Answer 1 · Thu Jun 07 2018 21:21:26 GMT+0800 (China Standard Time)

Many sites return 403 for the default user agent of python-requests/<version> (tested using curl -i --header 'User-agent: python-requests/2.10.0' http://www.halagroup.com/robots.txt, including:

Per the original robots.txt RFC,

On server response indicating access restrictions (HTTP Status
Code 401 or 403) a robot should regard access to the site
completely restricted.

For the example from planate.net, it was successful for me when I tried it on the current master:

from reppy.robots import Robots
robots = Robots.fetch('http://planate.net/robots.txt')
# This returned True for me, as expected
robots.allowed('http://planate.net/careers/', 'foo-agent')

Are you using any custom headers, like providing your own user agent?

rock321987 · Answer 2 · Fri Jun 08 2018 01:40:04 GMT+0800 (China Standard Time)

That's weird. I am getting False in python2.7, python3.5 and python3.6 for the code you mentioned above.. I have installed it both via pip and also from source.

Here is the snapshot

Is there any way to know the status code returned during fetch of robots url? I consider sites giving these errors as allowed for all.

Brandon Forehand · Answer 3 · Fri Jun 08 2018 04:27:51 GMT+0800 (China Standard Time)

@rock321987: It could be due to IP-based blacklisting or dynamic banning by the site in question. Presumably, @dlecocq is using a different IP than you.

Can you confirm what the response is using either curl from the command line or pdb and extracting the response using reppy?

Also, it's considered polite to temporarily self-block on 5XX status codes for robots.txt similar to 401/403 status code because it's possible that the site is having issues due to crawling. For example, Google will not crawl your site if you return a 500 or 503 status code for robots.txt.

Brandon Forehand · Answer 4 · Fri Jun 08 2018 06:10:33 GMT+0800 (China Standard Time)

You could also add some logging to an after_response_hook to get the status code for the response.

rock321987 · Answer 5 · Fri Jun 08 2018 10:11:59 GMT+0800 (China Standard Time)

@b4hand IP ban is not a problem. I am in anyway using proxy. Using wget I get 403 which is expected. I will try it on another machine(though it shouldn't be happening but just for sake of convincing myself) and get back to you.

How can I add that logging? Do I need to make modification in reppy's source code?

Dan Lecocq · Answer 6 · Sat Jun 09 2018 00:25:14 GMT+0800 (China Standard Time)

It's from requests. reppy will forward on any *args and **kwargs provided to Robots.fetch directly to requests.get, which allows you to do:

def after_response_hook(response):
    print 'Raw HTTP response: %s' % response

Robots.fetch(..., after_response_hook=after_response_hook)

The after_response_hook keyword argument allows you to have a callback once the raw response has been received.

Jim Kelly · Answer 7 · Fri Jan 18 2019 01:36:27 GMT+0800 (China Standard Time)

I know this is closed, but I'm curious because I don't seem to see a way to do this - is it possible to provide a user-agent that reppy will use when fetching robots.txt? I'm getting 403s, and it's resolved by providing a user agent. With a library like robotexclusionrulesparser, you can set the user agent. Not sure if there's a similar option for reppy?

Dan Lecocq · Answer 8 · Fri Jan 18 2019 03:19:22 GMT+0800 (China Standard Time)

The Robots.fetch method will let you. It accepts **kwargs which are passed to requests, so you can say Robots.fetch(..., headers={'user-agent': 'my-awesome-user-agent'}). When using the cache, you can include the same **kwargs at cache creation time.

Jim Kelly · Answer 9 · Fri Jan 18 2019 03:31:25 GMT+0800 (China Standard Time)

Ah, I had looked at the source but missed the use of self.args in the cache, thanks!