AntoineGagne / robots

A parser for robots.txt with support for wildcards. See also RFC 9309.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Empty rules should be ignored

1player opened this issue · comments

I'm dealing with a server which has the following robots.txt file:

# START YOAST BLOCK
# ---------------------------
User-agent: *
Disallow:
# ---------------------------
# END YOAST BLOCK

This library assumes everything is disallowed:

iex(1)> {:ok, rules} = :robots.parse("User-agent: *\nDisallow:\n\n", 200)
%{"*" => {[], [""]}}
iex(2)> :robots.is_allowed("example/1.0", "/", rules)
false

This is incorrect. Google says its crawler completely ignore any empty rules: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#disallow

And Yoast, a SEO service which generated the robots file, explicitly shows that snippet as an example of an "allow all" rule: https://yoast.com/ultimate-guide-robots-txt/#syntax (see section titled "The disallow directive")

The simplest way of fixing this would be to do what Google does: if there is no path specified in Allow or Disallow rules, ignore the rule completely.

Sorry for the delay and thanks for submitting the issue. This should be fixed by #18. I will publish a release with the fix soon

Thanks, you have saved me from spending a weekend learning Erlang to push a fix myself :) Much appreciated!