spatie / crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

Home Page:https://freek.dev/308-building-a-crawler-in-php

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Crawling larger sites

spekulatius opened this issue · comments

Hey Spatie Team!

Awesome library! I was wondering how one would use it to crawl a larger site in chunks?

The sites in question have over one million pages each. Due to server load, time and memory issues I don't want to crawl it at once - rather in chunks of, for example, 1000 each.

I've written a queue helper based on the ArrayCrawlQueue storing the progress.

But due to https://github.com/spatie/crawler/blob/master/src/LinkAdder.php#L43 no further URLs past the MaximumCrawlCount are added.

maximumDepth considers only URL segments as far as I can see.

I wonder if I missed anything explaining how to crawl larger sites?

Cheers,
Peter

What do you mean by server load and memory issues? Target server or is the crawler taking too much memory?

I'm referring to the target server here - if I send out numerous concurrent requests to a backend-heavy application this might unintended side-effects. So I aim at crawling slow and politely.

Support for increased politeness was added not too long ago: setDelayBetweenRequest. It works better with concurrency 1. IMHO crawling N request at max speed and then waiting is not more polite than crawling 1 page every X seconds.

Yes, this doesn't solve the problem of a very long running queue job on my end then. By the way, my approach allows to follow your idea more without cloaking your own server: crawl N request at full speed, store the queue, end the process, and at a later time, reload the queue and pass it to the crawler as a starting point again.

So the real problem here is that you want to adapt the crawler to a job/workers system. Am I right?

Yeah @Redominus, I want to use it as part of a batches Laravel process.

If I, for example, want to crawl 100 websites but don't want to use 100 processes which all take up memory (and block other batched processes) I need to stop the PHP process in-between. By storing the queue data only, I can reload it later and continue exactly were I left off. The change allows to crawl, lets say, 20 pages with a minimal delay, store the queue and get to the next website without keeping an idle worker waiting for a timeout.

While I see how it could be done indirectly using queues, but I feel this isn't optimal: As soon as I want to e.g. store queues differently I need to write my own driver for logic+storage again. This way everyone would then write their own Redis+logic or DB+logic drivers and we can't swap them out easily.

Closing this, we'll continue this in #331