Empty rules should be ignored

Question

Empty rules should be ignored

1player opened this issue 10 months ago · comments

Stéphane Travostino commented 10 months ago

I'm dealing with a server which has the following robots.txt file:

# START YOAST BLOCK
# ---------------------------
User-agent: *
Disallow:
# ---------------------------
# END YOAST BLOCK

This library assumes everything is disallowed:

iex(1)> {:ok, rules} = :robots.parse("User-agent: *\nDisallow:\n\n", 200)
%{"*" => {[], [""]}}
iex(2)> :robots.is_allowed("example/1.0", "/", rules)
false

This is incorrect. Google says its crawler completely ignore any empty rules: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#disallow

And Yoast, a SEO service which generated the robots file, explicitly shows that snippet as an example of an "allow all" rule: https://yoast.com/ultimate-guide-robots-txt/#syntax (see section titled "The disallow directive")

The simplest way of fixing this would be to do what Google does: if there is no path specified in Allow or Disallow rules, ignore the rule completely.

Antoine Gagné · Answer 1 · Wed Nov 22 2023 05:03:04 GMT+0800 (China Standard Time)

Sorry for the delay and thanks for submitting the issue. This should be fixed by #18. I will publish a release with the fix soon

Stéphane Travostino · Answer 2 · Wed Nov 22 2023 18:40:25 GMT+0800 (China Standard Time)

Thanks, you have saved me from spending a weekend learning Erlang to push a fix myself :) Much appreciated!