VIPnytt / RobotsTxtParser

An extensible robots.txt parser and client library, with full support for every directive and specification.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue with rules containing both wildcards * and an end anchor $

webarchitect609 opened this issue · comments

Bug Report

SUMMARY

The end of the URL directive $ is not supported as described here
https://developers.google.com/search/reference/robots_txt

STEPS TO REPRODUCE

Save in a file and run following code:

<?php

require 'vendor/autoload.php';

$robotsTxtContent = <<<END
User-agent: *
Disallow: /*.jpg$

END;

$txtClient = new \vipnytt\RobotsTxtParser\TxtClient('http://example.com', 200, $robotsTxtContent);

var_dump($txtClient->userAgent('*')->isDisallowed('/image.jpg'));

var_dump($txtClient->userAgent('*')->isDisallowed('http://example.com/image.jpg'));

var_dump($txtClient->userAgent('*')->isDisallowed('http://example.com/foo/bar/image.jpg'));
EXPECTED RESULTS

Three times of bool(true) "var_dumped"

ACTUAL RESULTS

Three times of bool(false) "var_dumped"

/usr/bin/php test.php
bool(false)
bool(false)
bool(false)

Process finished with exit code 0

The issue is isolated to rules containing both * (wildcard) and $ (end anchor). Rules containing only one of these, are unaffected (all kind of different tests passes).
The bug is here: /src/Client/Directives/DirectiveClientTrait.php#L113

Seems like no other robots.txt parsers has any solution either (that I'm aware of). Any help is appreciated!

Fixed in version 2.0.1
Thank you for the bug report!