t1gor / Robots.txt-Parser-Class

Php class for robots.txt parse

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Regex meta char escape using preg_quote

ranvis opened this issue · comments

When using preg_match('@...@'), preg_quote($rule, '@') is expected to be used to escape input.
Currently one of the following warnings occurs when a path contains some meta character:

PHP Warning: preg_match(): Compilation failed: missing ) at offset 15 in /path/to/vendor/t1gor/robots-txt-parser/source/robotstxtparser.php on line 836
PHP Warning: preg_match(): Compilation failed: unmatched parentheses at offset 1 in /path/to/vendor/t1gor/robots-txt-parser/source/robotstxtparser.php on line 836

I've seen it in some rare cases, but unfortunately never had the time to investigate it... This is indeed a bug.

Regex is not my expertise, but could this be as simple as using an non-valid URL character instead of "@"?
All of the "@"s should already be escaped as far as I can see, but I'm clearly wrong about that... It's not my code, and I don't fully understand it either, to be honest...

rawurlencode()ing paths as currently do, I think, is a good way, as URL may contain any char code.
But that isn't make regex escape unnecessary as it is only URL escaping.
I just took a glance at code so I may be wrong about.
Anyway sorry about being lazy not to add failing case. Tested on e1b052c.

require_once(__DIR__ . '/vendor/autoload.php');
$parser = new \RobotsTxtParser('User-agent: webcrawler
Disallow: /(
Disallow: /)
Disallow: /.
');
var_dump($parser->isAllowed('/%5C.', 'webcrawler') == true); // bool(false)
var_dump($parser->isAllowed('/(', 'webcrawler') == false); // bool(false)

I just took a look at the issue again, unable to fix it (for now), but here is something to continue on for the next person who tries to fix it...

    private function checkBasicRule($rule, $path)
    {
        $rule = $this->encode_url($rule);
        $rule = preg_quote($rule);
        // match result
        if (preg_match('@' . $rule . '@', $path)) {
            if (mb_stripos($rule, '$') !== false) {
                if (mb_strlen($rule) - 1 == mb_strlen($path)) {
                    return true;
                }
            } else {
                $this->log[] = "Rule match: Path";
                return true;
            }
        }
        return false;
    }

I'm not sure what the problem is, but I think this template is a good place to start...