seomoz / reppy

Modern robots.txt Parser for Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ValueError : Need User Agent

azotlikid opened this issue · comments

commented

Hello,

With some robots.txt i get this exception :

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "reppy/robots.pyx", line 78, in reppy.robots.FetchMethod (reppy/robots.cpp:3235)
  File "reppy/robots.pyx", line 89, in reppy.robots.FetchMethod (reppy/robots.cpp:2962)
  File "reppy/robots.pyx", line 71, in reppy.robots.ParseMethod (reppy/robots.cpp:2375)
  File "reppy/robots.pyx", line 129, in reppy.robots.Robots.__init__ (reppy/robots.cpp:3947)
ValueError: Need User-Agent

example :

from reppy.robots import Robots
r = Robots.fetch('https://tools.pingdom.com/robots.txt')
r = Robots.fetch('http://www.lidd.fr/robots.txt')

This is almost certainly a duplicate of #47.

commented

You're right :

https://tools.pingdom.com/robots.txt does not have a User-agent
http://www.lidd.fr/robots.txt has a typo : "User-Agent" instead of "User-agent"

What is the best policy? Crawl the site as if there's no robots.txt or skip the website ?

For http://www.lidd.fr/robots.txt, I believe the User-Agent directive should be case insensitive, I'll see if that's a bug in reppy.

Regarding your "best policy" In some sense, if the robots.txt is malformed, it's hard to know what to do from a programmatic perspective. We discussed some options in #47 already. One solution would be to treat the directives as applied to the wild card user agent. That seems like the most sane fallback since that's probably the intent of the robots.txt. Alternatively, a more strict interpretation would be to disallow everything.

To clarify, this is from the original REP spec:

The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form ":". The field name is case insensitive.

So User-Agent should be interpreted in a case-insensitive manner, as should all of the other field names.

When I download http://www.lidd.fr/robots.txt, it appears to have a Unicode BOM at its start which causes the initial User-Agent: * to be ignored because it doesn't match a known directive. According to Google, they appear to ignore a BOM; however, this appears to be a common problem as reported here. I'm going to say that we should probably strip any possible BOM from the beginning of the file, but this will need to be done in rep-cpp. So that is a separate issue.

I'm going to close out this issue since the non-BOM URL falls into #47.

When seomoz/rep-cpp#14 gets merged, I'll create a PR to update the Git submodule in this project, and that will eventually land in a new release of reppy.

v0.4.5 has been released on pypi, and it addresses the UTF-8 BOM (byte order mark) issue.