ValueError : Need User Agent

Question

ValueError : Need User Agent

azotlikid opened this issue 7 years ago · comments

Hello,

With some robots.txt i get this exception :

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "reppy/robots.pyx", line 78, in reppy.robots.FetchMethod (reppy/robots.cpp:3235)
  File "reppy/robots.pyx", line 89, in reppy.robots.FetchMethod (reppy/robots.cpp:2962)
  File "reppy/robots.pyx", line 71, in reppy.robots.ParseMethod (reppy/robots.cpp:2375)
  File "reppy/robots.pyx", line 129, in reppy.robots.Robots.__init__ (reppy/robots.cpp:3947)
ValueError: Need User-Agent

example :

from reppy.robots import Robots
r = Robots.fetch('https://tools.pingdom.com/robots.txt')
r = Robots.fetch('http://www.lidd.fr/robots.txt')

Brandon Forehand · Answer 1 · Wed Feb 08 2017 00:19:40 GMT+0800 (China Standard Time)

This is almost certainly a duplicate of #47.

magic · Answer 2 · Wed Feb 08 2017 06:52:11 GMT+0800 (China Standard Time)

You're right :

https://tools.pingdom.com/robots.txt does not have a User-agent
http://www.lidd.fr/robots.txt has a typo : "User-Agent" instead of "User-agent"

What is the best policy? Crawl the site as if there's no robots.txt or skip the website ?

Brandon Forehand · Answer 3 · Wed Feb 08 2017 06:56:55 GMT+0800 (China Standard Time)

For http://www.lidd.fr/robots.txt, I believe the User-Agent directive should be case insensitive, I'll see if that's a bug in reppy.

Regarding your "best policy" In some sense, if the robots.txt is malformed, it's hard to know what to do from a programmatic perspective. We discussed some options in #47 already. One solution would be to treat the directives as applied to the wild card user agent. That seems like the most sane fallback since that's probably the intent of the robots.txt. Alternatively, a more strict interpretation would be to disallow everything.

Brandon Forehand · Answer 4 · Wed Feb 08 2017 06:59:20 GMT+0800 (China Standard Time)

To clarify, this is from the original REP spec:

The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form ":". The field name is case insensitive.

So User-Agent should be interpreted in a case-insensitive manner, as should all of the other field names.

Brandon Forehand · Answer 5 · Wed Feb 08 2017 07:09:12 GMT+0800 (China Standard Time)

When I download http://www.lidd.fr/robots.txt, it appears to have a Unicode BOM at its start which causes the initial User-Agent: * to be ignored because it doesn't match a known directive. According to Google, they appear to ignore a BOM; however, this appears to be a common problem as reported here. I'm going to say that we should probably strip any possible BOM from the beginning of the file, but this will need to be done in rep-cpp. So that is a separate issue.

Brandon Forehand · Answer 6 · Wed Feb 08 2017 07:10:15 GMT+0800 (China Standard Time)

I'm going to close out this issue since the non-BOM URL falls into #47.

Brandon Forehand · Answer 7 · Wed Feb 08 2017 07:45:44 GMT+0800 (China Standard Time)

When seomoz/rep-cpp#14 gets merged, I'll create a PR to update the Git submodule in this project, and that will eventually land in a new release of reppy.

Brandon Forehand · Answer 8 · Thu Feb 09 2017 08:57:49 GMT+0800 (China Standard Time)

v0.4.5 has been released on pypi, and it addresses the UTF-8 BOM (byte order mark) issue.