seomoz / reppy

Modern robots.txt Parser for Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

www.bestbuy.com/robots.txt false negative

enewhuis opened this issue · comments

commented

For some reason this URL

/site/Global/Free-Shipping/pcmcat276800050002.c?id=pcmcat276800050002

is disallowed by the library against this robots.txt from www.bestbuy.com

User-agent: *
Disallow: /*id=pcmcat140800050004
Disallow: /*id=pcmcat143800050032
Disallow: /nex/
Disallow: /shop/
Disallow: /*~~*
Disallow: /*jsessionid=
Disallow: /*dnmId=*
Disallow: /*ld=*lg=*rd=*
Disallow: /m/e/*
Disallow: /site/builder/*
Disallow: /site/promo/black-friday-*
Disallow: /site/promo/Black-Friday-*
Disallow: /*template=_gameDetailsTab
Disallow: /*template=_movieDetailsTab
Disallow: /*template=_musicDetailsTab
Disallow: /*template=_softwareDetailsTab
Disallow: /*template=_accessoriesTab
Disallow: /*template=_castAndCrewTab
Disallow: /*template=_editorialTab
Disallow: /*template=_episodesTab
Disallow: /*template=_protectionAndServicesTab
Disallow: /*template=_specificationsTab
commented

Some of the asterisks were eaten by my pasting.

I was able to repro this testing against www.bestbuy.com, but could not repro it in isolation in a test. Digging into it a little bit more, it appears that Best Buy returns a 403 status code based on the user agent provided. Witness:

curl --user-agent 'python-requests/1.2.0' -vv 'http://www.bestbuy.com/robots.txt'
* Hostname was NOT found in DNS cache
*   Trying 184.84.183.104...
* Connected to www.bestbuy.com (184.84.183.104) port 80 (#0)
> GET /robots.txt HTTP/1.1
> User-Agent: python-requests/1.2.0
> Host: www.bestbuy.com
> Accept: */*
> 
< HTTP/1.1 403 Forbidden
* Server AkamaiGHost is not blacklisted
< Server: AkamaiGHost
< Mime-Version: 1.0
< Content-Type: text/html
< Content-Length: 278
< Expires: Mon, 15 Dec 2014 16:06:17 GMT
< Date: Mon, 15 Dec 2014 16:06:17 GMT
< Connection: close
< 
<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>

You don't have permission to access "http&#58;&#47;&#47;www&#46;bestbuy&#46;com&#47;robots&#46;txt" on this server.<P>
Reference&#32;&#35;18&#46;64b754b8&#46;1418659577&#46;f69ee5c
</BODY>
</HTML>
* Closing connection 0

If you use the curl user agent, however, you get back what you've been no-doubt been seeing in your browser:

curl -vv 'http://www.bestbuy.com/robots.txt'
* Hostname was NOT found in DNS cache
*   Trying 184.84.183.104...
* Connected to www.bestbuy.com (184.84.183.104) port 80 (#0)
> GET /robots.txt HTTP/1.1
> User-Agent: curl/7.35.0
> Host: www.bestbuy.com
> Accept: */*
> 
< HTTP/1.1 200 OK
* Server Apache is not blacklisted
< Server: Apache
< ETag: "4808de921619e8b172adcce026c1d263:1415638626"
< Last-Modified: Mon, 10 Nov 2014 16:57:06 GMT
< Content-Type: text/plain
< Date: Mon, 15 Dec 2014 16:04:15 GMT
< Content-Length: 2146
< Connection: keep-alive
< 
User-agent: *
Disallow: /*id=pcmcat140800050004
Disallow: /*id=pcmcat143800050032
Disallow: /nex/
Disallow: /shop/
Disallow: /*~~*
Disallow: /*jsessionid=
Disallow: /*dnmId=*
Disallow: /*ld=*lg=*rd=*
Disallow: /m/e/*
Disallow: /site/builder/*
Disallow: /site/promo/black-friday-*
Disallow: /site/promo/Black-Friday-*
Disallow: /*template=_gameDetailsTab
Disallow: /*template=_movieDetailsTab
Disallow: /*template=_musicDetailsTab
Disallow: /*template=_softwareDetailsTab
Disallow: /*template=_accessoriesTab
Disallow: /*template=_castAndCrewTab
Disallow: /*template=_editorialTab
Disallow: /*template=_episodesTab
Disallow: /*template=_protectionAndServicesTab
Disallow: /*template=_specificationsTab

Sitemap: http://www.bestbuy.com/sitemap_p_index.xml
Sitemap: http://www.bestbuy.com/sitemap_p_1.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_2.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_3.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_4.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_5.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_6.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_7.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_8.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_9.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_10.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_11.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_12.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_13.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_14.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_15.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_16.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_17.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_18.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_19.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_20.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_21.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_22.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_23.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_24.xml.gz
Sitemap: http://www.bestbuy.com/sitemap_p_25.xml.gz
Sitemap: http://www.* Connection #0 to host www.bestbuy.com left intact

Or, from python:

import requests
session = requests.Session()
session.headers.update({'User-Agent': 'not-a-bot'})
response = session.get('http://www.bestbuy.com/robots.txt')
response.status_code
# 200

session = requests.Session()
response = session.get('http://www.bestbuy.com/robots.txt')
response.status_code
# 403

In order to fix this, you can supply your own requests.Session object:

import requests
session = requests.Session()
session.headers.update({'User-Agent': 'not-a-bot'})
cache = RobotsCache(session=session)
cache.allowed('http://www.bestbuy.com/site/Global/Free-Shipping/pcmcat276800050002.c?id=pcmcat276800050002', 'not-a-bot')
# True

For what it is worth, a lot of websites will block the "default" user agent for most general purpose programming languages like Python, Perl, Java, etc. The assumption is if you do such a poor job of writing a bot that you can't change the user agent, then most likely, you've not done a good job of following other netiquette guidelines.

Closing this out, as it doesn't correspond to a bug in reppy but rather a server doing user agent based cloaking.

I believe we need detailed usage documentation with example for reppy. The documentation do tell though that
Customizing fetch
The fetch method accepts *args and **kwargs that are passed on to requests.get, allowing you to customize the way the fetch is executed:

robots = Robots.fetch('http://example.com/robots.txt', headers={...})

Hinting that we can send "User-Agent" header info to fetch which calls requests.get

That's true if you're using fetch directly, but this issue was discussing using RobotsCache which calls fetch indirectly with the arguments passed via the constructor for the cache.

Caching
There are two cache classes provided -- RobotsCache, which caches entire reppy.Robots objects, and AgentCache, which only caches the reppy.Agent relevant to a client. These caches duck-type the class that they cache for the purposes of checking if a URL is allowed:

from reppy.cache import RobotsCache
cache = RobotsCache(capacity=100)
cache.allowed('http://example.com/foo/bar', 'my-user-agent')

from reppy.cache import AgentCache
cache = AgentCache(agent='my-user-agent', capacity=100)
cache.allowed('http://example.com/foo/bar')
Like reppy.Robots.fetch, the cache constructory accepts a ttl_policy to inform the expiration of the fetched Robots objects, as well as *args and **kwargs to be passed to reppy.Robots.fetch.

The documentation here doesn't explicitly provide an example that we can send a requests session object though.

I would prefer if you present this information in the form of a pull request rather than commenting on this closed issue. If you open a PR for doc changes, they are very likely to be merged.