Binary files are parsed, ignoring setParseableMimeTypes
jespejoh opened this issue · comments
When setting both attributes, the crawler sometimes fails to ignore binary files (e.g. ZIP, video). I've pinned the error down to the __invoke() function on the CrawlRequestFulfilled:
$body = $this->getBody($response); <-- At this point $body is an empty string so the script continues but it fails at later stage.
$robots = new CrawlerRobots(
$response->getHeaders(),
$body,
$this->crawler->mustRespectRobots()
);
$crawlUrl = $this->crawler->getCrawlQueue()->getUrlById($index);
if ($this->crawler->mayExecuteJavaScript()) {
$body = $this->getBodyAfterExecutingJavaScript($crawlUrl->url);
$response = $response->withBody(stream_for($body));
}
I've managed to solve this issue locally by just adding this at the beginning of the function, so the crawling for this URL stops if the mime type is not supported:
$contentType = $response->getHeaderLine('Content-Type');
if (! $this->isMimetypeAllowedToParse($contentType)) {
return '';
}
Not sure if it has any unexpected consequences but so far it works as expected. What do you think?
Happy to open a PR if you prefer it that way.
Dear contributor,
because this issue seems to be inactive for quite some time now, I've automatically closed it. If you feel this issue deserves some attention from my human colleagues feel free to reopen it.
have same error