spatie / crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

Home Page:https://freek.dev/308-building-a-crawler-in-php

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Binary files are parsed, ignoring setParseableMimeTypes

jespejoh opened this issue · comments

When setting both attributes, the crawler sometimes fails to ignore binary files (e.g. ZIP, video). I've pinned the error down to the __invoke() function on the CrawlRequestFulfilled:

    $body = $this->getBody($response);    <-- At this point $body is an empty string so the script continues but it fails at later stage.

    $robots = new CrawlerRobots(
        $response->getHeaders(),
        $body,
        $this->crawler->mustRespectRobots()
    );

    $crawlUrl = $this->crawler->getCrawlQueue()->getUrlById($index);

    if ($this->crawler->mayExecuteJavaScript()) {
        $body = $this->getBodyAfterExecutingJavaScript($crawlUrl->url);

        $response = $response->withBody(stream_for($body));
    }

I've managed to solve this issue locally by just adding this at the beginning of the function, so the crawling for this URL stops if the mime type is not supported:

    $contentType = $response->getHeaderLine('Content-Type');  

    if (! $this->isMimetypeAllowedToParse($contentType)) {
        return '';
    }

Not sure if it has any unexpected consequences but so far it works as expected. What do you think?

Happy to open a PR if you prefer it that way.

Dear contributor,

because this issue seems to be inactive for quite some time now, I've automatically closed it. If you feel this issue deserves some attention from my human colleagues feel free to reopen it.

have same error