Incorrect result for mentioned URLs
rock321987 opened this issue · comments
Hi again,
PFA few of the URLs where incorrect result is returned (False here)
https://thesparkgroup.com/available-positions/
http://planate.net/careers/
http://www.halagroup.com/career.php?id=13
Most of the case that I found were:
- Wordpress site
- sites with status 404, 403 or giving error on robots.txt page
- Last one mentioned in file seems to fail for no reason till I understand.
From what I understand these all URLs should return True. Correct me if I am wrong.
Many sites return 403 for the default user agent of python-requests/<version>
(tested using curl -i --header 'User-agent: python-requests/2.10.0' http://www.halagroup.com/robots.txt
, including:
- https://thesparkgroup.com/robots.txt
- http://www.halagroup.com/robots.txt
- http://careers.saharapcc.com/robots.txt
- http://www.ubm.com/robots.txt
- https://www.etaxjobs.com/robots.txt
Per the original robots.txt RFC,
On server response indicating access restrictions (HTTP Status
Code 401 or 403) a robot should regard access to the site
completely restricted.
For the example from planate.net, it was successful for me when I tried it on the current master
:
from reppy.robots import Robots
robots = Robots.fetch('http://planate.net/robots.txt')
# This returned True for me, as expected
robots.allowed('http://planate.net/careers/', 'foo-agent')
Are you using any custom headers, like providing your own user agent?
That's weird. I am getting False in python2.7, python3.5 and python3.6 for the code you mentioned above.. I have installed it both via pip and also from source.
Here is the snapshot
Is there any way to know the status code returned during fetch of robots url? I consider sites giving these errors as allowed for all.
@rock321987: It could be due to IP-based blacklisting or dynamic banning by the site in question. Presumably, @dlecocq is using a different IP than you.
Can you confirm what the response is using either curl
from the command line or pdb
and extracting the response using reppy
?
Also, it's considered polite to temporarily self-block on 5XX status codes for robots.txt
similar to 401/403 status code because it's possible that the site is having issues due to crawling. For example, Google will not crawl your site if you return a 500 or 503 status code for robots.txt
.
You could also add some logging to an after_response_hook
to get the status code for the response.
@b4hand IP ban is not a problem. I am in anyway using proxy. Using wget I get 403 which is expected. I will try it on another machine(though it shouldn't be happening but just for sake of convincing myself) and get back to you.
How can I add that logging? Do I need to make modification in reppy's source code?
It's from requests
. reppy
will forward on any *args
and **kwargs
provided to Robots.fetch
directly to requests.get
, which allows you to do:
def after_response_hook(response):
print 'Raw HTTP response: %s' % response
Robots.fetch(..., after_response_hook=after_response_hook)
The after_response_hook
keyword argument allows you to have a callback once the raw response has been received.
I know this is closed, but I'm curious because I don't seem to see a way to do this - is it possible to provide a user-agent that reppy will use when fetching robots.txt? I'm getting 403s, and it's resolved by providing a user agent. With a library like robotexclusionrulesparser, you can set the user agent. Not sure if there's a similar option for reppy?
The Robots.fetch
method will let you. It accepts **kwargs
which are passed to requests
, so you can say Robots.fetch(..., headers={'user-agent': 'my-awesome-user-agent'})
. When using the cache
, you can include the same **kwargs
at cache creation time.
Ah, I had looked at the source but missed the use of self.args
in the cache, thanks!