Allow/Disallow rules not handled correctly
ogolovanov opened this issue · comments
From https://yandex.com/support/webmaster/controlling-robot/robots-txt.xml?lang=ru#simultaneous
The Allow and Disallow directives from the corresponding User-agent block are sorted according to URL prefix length (from shortest to longest) and applied in order. If several directives match a particular site page, the robot selects the last one in the sorted list. This way the order of directives in the robots.txt file doesn't affect how they are used by the robot.
Source robots.txt:
User-agent: Yandex
Allow: /
Allow: /catalog/auto
Disallow: /catalogSorted robots.txt:
User-agent: Yandex
Allow: /
Disallow: /catalog
Allow: /catalog/auto
$c = <<<ROBOTS
User-agent: *
Allow: /
Allow: /catalog/auto
Disallow: /catalog
ROBOTS;
$r = new RobotsTxtParser($c);
$url = 'http://test.ru/catalog/';
var_dump($r->isDisallowed($url));
Result: false
Expected result: true
For Google this is different :
At a group-member level, in particular for allow and disallow directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule. The order of precedence for rules with wildcards is undefined.