seomoz / reppy

Modern robots.txt Parser for Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RobotsCache: Exceptions cannot be handled (not being re-thrown)

dosisod opened this issue · comments

When checking if a page is allowed and an exception is thrown, a traceback is printed to the screen but not rethrown. The allowed() function simply returns False, making it hard to differentiate between disallowed sites and offline sites.

from reppy.cache import RobotsCache

rob=RobotsCache(capacity=100)

# localhost:1234 is not bound, its mean to error
# ConnectionException is thrown. it is printed but not rethrown
out=rob.allowed("http://localhost:1234", "bot")

print(out) # returns False

This is similar to #110 as there is only a True or False output, but here an exception is not being thrown.

I would put in a pull request, but I am having issues understanding where/what to do to test the code I need. My suggestion is to have an optional rethrow=False param here:

def allowed(self, url, agent):

This allows for backwards compatibility with the ability to rethrow instead of returning (if desired)

Great question! And it's one I had to think back a bit to remember how we manage this. I'll take you on a quick tour, but there's an easy change to accomplish what it sounds like you want.

So, the the Robots class which can do all the fetching, parsing, etc., allows for users to pass through any arguments to fetch that they'd pass to requests, so a lot of behavior can be controlled there. The Cache classes have even more bells and whistles, primarily driven by policies in reppy.cache.policy. At the moment, there are two implemented - 1) one that returns a default object (and for RobotsCache the default cache policy returns a Robots object that returns False for every URL if an exception is raised: https://github.com/seomoz/reppy/blob/master/reppy/cache/__init__.py#L79), and 2) one that reraises the exception. The first might be useful in situations where we just want to answer the question: "are we for sure allowed to fetch this URL", and anything other than successfully sussing that out would mean 'no.'

But if you have code that would like to deal with the nuance of different types of errors, you absolutely can! And you could extend the CachePolicyBase to implement whatever behavior you'd like if the built-in ones don't suit your needs. But if you are just looking to have exceptions raised:

from reppy.cache import RobotsCache
from reppy.cache.policy import ReraiseExceptionPolicy

rob=RobotsCache(capacity=100, cache_policy=ReraiseExceptionPolicy(60))

# localhost:1234 is not bound, its mean to error
# ConnectionException is actually thrown now
out=rob.allowed("http://localhost:1234", "bot")

print(out) # name is not defined

Thank you, this solved my problem. The solution was a lot easier then I anticipated. Thanks for the quick response!