Allow/Disallow rules not handled correctly

Question

Allow/Disallow rules not handled correctly

ogolovanov opened this issue 8 years ago · comments

From https://yandex.com/support/webmaster/controlling-robot/robots-txt.xml?lang=ru#simultaneous

The Allow and Disallow directives from the corresponding User-agent block are sorted according to URL prefix length (from shortest to longest) and applied in order. If several directives match a particular site page, the robot selects the last one in the sorted list. This way the order of directives in the robots.txt file doesn't affect how they are used by the robot.

Source robots.txt:

User-agent: Yandex
Allow: /
Allow: /catalog/auto
Disallow: /catalog

Sorted robots.txt:

User-agent: Yandex
Allow: /
Disallow: /catalog
Allow: /catalog/auto

$c = <<<ROBOTS
User-agent: *
Allow: /
Allow: /catalog/auto
Disallow: /catalog
ROBOTS;

$r = new RobotsTxtParser($c);
$url = 'http://test.ru/catalog/';
var_dump($r->isDisallowed($url));

Result: false
Expected result: true

LeMoussel · Answer 1 · Fri Jan 06 2017 01:47:46 GMT+0800 (China Standard Time)

For Google this is different :

At a group-member level, in particular for allow and disallow directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule. The order of precedence for rules with wildcards is undefined.

https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?hl=en#google-supported-non-group-member-records