spatie / crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

Home Page:https://freek.dev/308-building-a-crawler-in-php

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Crawler doesn't push subdomain URLs to queue if CrawlProfile doesn't extend CrawlSubdomains

kejkej31 opened this issue · comments

Hey,

I've created a CustomCrawlProfile that was extending the main class CrawlProfile.
My CustomCrawlProfile allowed for subdomains to be crawled + had some extra filtering, but the crawler was stopping too early, not enough URLs were being pushed to queue.
I've noticed that in CrawlRequestFulfilled class there's this code:

        if (! $this->crawler->getCrawlProfile() instanceof CrawlSubdomains) {
            if ($crawlUrl->url->getHost() !== $this->crawler->getBaseUrl()->getHost()) {
                return;
            }
        }

So even though the URL got crawled, HTML extracted, it won't push found URLs to queue.
Is this behavior correct? Shouldn't CrawlProfile decide whether something gets crawled instead of checking here if something extends CrawlSubdomains?
Maybe it was my bad, but in the documentation I couldn't find anything saying I should extend from the CrawlSubdomains class. I assumed it's just a "ready to go" class and I didn't have to extend it. Took me some time to find out why crawl is ending early.