Crawling larger sites

Question

Crawling larger sites

spekulatius opened this issue 4 years ago · comments

Hey Spatie Team!

Awesome library! I was wondering how one would use it to crawl a larger site in chunks?

The sites in question have over one million pages each. Due to server load, time and memory issues I don't want to crawl it at once - rather in chunks of, for example, 1000 each.

I've written a queue helper based on the ArrayCrawlQueue storing the progress.

But due to https://github.com/spatie/crawler/blob/master/src/LinkAdder.php#L43 no further URLs past the MaximumCrawlCount are added.

maximumDepth considers only URL segments as far as I can see.

I wonder if I missed anything explaining how to crawl larger sites?

Cheers,
Peter

Carlos Monroy · Answer 1 · Tue Nov 03 2020 00:39:17 GMT+0800 (China Standard Time)

What do you mean by server load and memory issues? Target server or is the crawler taking too much memory?

Peter Thaleikis · Answer 2 · Tue Nov 03 2020 00:56:12 GMT+0800 (China Standard Time)

I'm referring to the target server here - if I send out numerous concurrent requests to a backend-heavy application this might unintended side-effects. So I aim at crawling slow and politely.

Carlos Monroy · Answer 3 · Tue Nov 03 2020 01:11:21 GMT+0800 (China Standard Time)

Support for increased politeness was added not too long ago: setDelayBetweenRequest. It works better with concurrency 1. IMHO crawling N request at max speed and then waiting is not more polite than crawling 1 page every X seconds.

Peter Thaleikis · Answer 4 · Tue Nov 03 2020 01:14:49 GMT+0800 (China Standard Time)

Yes, this doesn't solve the problem of a very long running queue job on my end then. By the way, my approach allows to follow your idea more without cloaking your own server: crawl N request at full speed, store the queue, end the process, and at a later time, reload the queue and pass it to the crawler as a starting point again.

Carlos Monroy · Answer 5 · Tue Nov 03 2020 01:57:26 GMT+0800 (China Standard Time)

So the real problem here is that you want to adapt the crawler to a job/workers system. Am I right?

Peter Thaleikis · Answer 6 · Tue Nov 03 2020 02:11:45 GMT+0800 (China Standard Time)

Yeah @Redominus, I want to use it as part of a batches Laravel process.

If I, for example, want to crawl 100 websites but don't want to use 100 processes which all take up memory (and block other batched processes) I need to stop the PHP process in-between. By storing the queue data only, I can reload it later and continue exactly were I left off. The change allows to crawl, lets say, 20 pages with a minimal delay, store the queue and get to the next website without keeping an idle worker waiting for a timeout.

While I see how it could be done indirectly using queues, but I feel this isn't optimal: As soon as I want to e.g. store queues differently I need to write my own driver for logic+storage again. This way everyone would then write their own Redis+logic or DB+logic drivers and we can't swap them out easily.

Freek Van der Herten · Answer 7 · Wed Nov 04 2020 07:34:40 GMT+0800 (China Standard Time)

Closing this, we'll continue this in #331