t1gor / Robots.txt-Parser-Class

Php class for robots.txt parse

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Allow/Disallow rules not handled correctly

ogolovanov opened this issue · comments

From https://yandex.com/support/webmaster/controlling-robot/robots-txt.xml?lang=ru#simultaneous

The Allow and Disallow directives from the corresponding User-agent block are sorted according to URL prefix length (from shortest to longest) and applied in order. If several directives match a particular site page, the robot selects the last one in the sorted list. This way the order of directives in the robots.txt file doesn't affect how they are used by the robot.

Source robots.txt:

User-agent: Yandex
Allow: /
Allow: /catalog/auto
Disallow: /catalog

Sorted robots.txt:

User-agent: Yandex
Allow: /
Disallow: /catalog
Allow: /catalog/auto

$c = <<<ROBOTS
User-agent: *
Allow: /
Allow: /catalog/auto
Disallow: /catalog
ROBOTS;

$r = new RobotsTxtParser($c);
$url = 'http://test.ru/catalog/';
var_dump($r->isDisallowed($url));

Result: false
Expected result: true

For Google this is different :

At a group-member level, in particular for allow and disallow directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule. The order of precedence for rules with wildcards is undefined.

https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt?hl=en#google-supported-non-group-member-records