Issue with rules containing both wildcards * and an end anchor $

Question

Issue with rules containing both wildcards * and an end anchor $

webarchitect609 opened this issue 6 years ago · comments

Bug Report

SUMMARY

The end of the URL directive $ is not supported as described here
https://developers.google.com/search/reference/robots_txt

STEPS TO REPRODUCE

Save in a file and run following code:

<?php

require 'vendor/autoload.php';

$robotsTxtContent = <<<END
User-agent: *
Disallow: /*.jpg$

END;

$txtClient = new \vipnytt\RobotsTxtParser\TxtClient('http://example.com', 200, $robotsTxtContent);

var_dump($txtClient->userAgent('*')->isDisallowed('/image.jpg'));

var_dump($txtClient->userAgent('*')->isDisallowed('http://example.com/image.jpg'));

var_dump($txtClient->userAgent('*')->isDisallowed('http://example.com/foo/bar/image.jpg'));

EXPECTED RESULTS

Three times of bool(true) "var_dumped"

ACTUAL RESULTS

Three times of bool(false) "var_dumped"

/usr/bin/php test.php
bool(false)
bool(false)
bool(false)

Process finished with exit code 0

Jan-Petter Gundersen · Answer 1 · Fri Jul 20 2018 23:02:45 GMT+0800 (China Standard Time)

The issue is isolated to rules containing both * (wildcard) and $ (end anchor). Rules containing only one of these, are unaffected (all kind of different tests passes).
The bug is here: /src/Client/Directives/DirectiveClientTrait.php#L113

Seems like no other robots.txt parsers has any solution either (that I'm aware of). Any help is appreciated!

Jan-Petter Gundersen · Answer 2 · Sun Jul 22 2018 00:03:35 GMT+0800 (China Standard Time)

Fixed in version 2.0.1
Thank you for the bug report!