t1gor / Robots.txt-Parser-Class

Php class for robots.txt parse

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Very poor performance if isAllowed()/isDisallowed() functions

ogolovanov opened this issue · comments

I took average ( 50-60 allow/disallow rules ) robots.txt file and 10000 urls for test.
Running isAllowed() method for each url took 37 seconds in total of cpu time.
Sorry, but this is very slow.

For example, look here.
50%(!!!) of all cpu time spent on $this->isInlineDirective($rule) call.
Why can't we make that call during robots initialization and save 50% of cpu time by doing that?


    protected function checkRuleSwitch($rule, $path)
    {
        switch ($this->isInlineDirective($rule)) {
            case self::DIRECTIVE_CLEAN_PARAM:
                if ($this->checkCleanParamRule($this->stripInlineDirective($rule), $path)) {
                    return true;
                }
                break;
            case self::DIRECTIVE_HOST;
                if ($this->checkHostRule($this->stripInlineDirective($rule))) {
                    return true;
                }
                break;
            default:
                if ($this->checkBasicRule($rule, $path)) {
                    return true;
                }
        }
        return false;
    }

Or look here.
Why can't we make call to $this->prepareRegexRule($this->encode_url($rule)) during initialization and save 30% of cpu time?

    private function checkBasicRule($rule, $path)
    {
        $rule = $this->prepareRegexRule($this->encode_url($rule));
        // change @ to \@
        $escaped = strtr($rule, array('@' => '\@'));
        // match result
        if (preg_match('@' . $escaped . '@', $path)) {
            if (strpos($escaped, '$') !== false) {
                if (mb_strlen($escaped) - 1 == mb_strlen($path)) {
                    return true;
                }
            } else {
                $this->log[] = 'Rule match: Path';
                return true;
            }
        }
        return false;
    }

RobotsTxtParser is expected to be initialized once and used many-many times.

$this->prepareRegexRule():
It was done during initialization a month ago, but not anymore. See #67 (comment).
814737c#diff-a3addb78c633d1984e69a06f8ece166cL325

$this->checkRuleSwitch():
Then this repo has to be rewritten from the ground and up...

There is also this initialization parsing performance issue #62

An PR is happily accepted, but I must admit this repo has more holes than a swizz cheese, so I'm not so sure it's worth fixing. An rewrite is the way to go, at least for long term...

Btw, if you don't care about the rendering feature, version 0.2.2 should save you 30% of the cpu time (according to your calculations).

https://github.com/t1gor/Robots.txt-Parser-Class/releases/tag/v0.2.2