spatie / crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

Home Page:https://freek.dev/308-building-a-crawler-in-php

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Best way to limit similar urls

aaronbauman opened this issue · comments

Use case:
I'm using a crawler to build a visual regression test battery, and I want to make it efficient.

So, I want to tell the crawler to limit similar URLs.
-- I want to crawl all top-level URLs
-- For each sub-directory, i only want 3 sub-pages
---- For example, I want to collect /about, /contact, /jobs, /news, and /blog
---- But given a set of job listings /jobs/1, /jobs/2, /jobs/3, /jobs/4, /jobs/5, /jobs/6 - i only want the first 3

Not sure where to start with this - would you suggest a crawl profile, a crawl queue, something else??
Thanks

Have you tried to use maximum crawl depth config?

Thanks for the follow up.
I did try max crawl depth, but I actually do not want to limit depth, just breadth for any particular subdirectory.

The solution I came up with is a relatively thin implementation of Spatie\Crawler\CrawlQueue\CrawlQueue, here's the code in case it helps someone else:
https://gist.github.com/aaronbauman/863c781f48572e644ca6b26d451653a6