Best way to limit similar urls
aaronbauman opened this issue · comments
Use case:
I'm using a crawler to build a visual regression test battery, and I want to make it efficient.
So, I want to tell the crawler to limit similar URLs.
-- I want to crawl all top-level URLs
-- For each sub-directory, i only want 3 sub-pages
---- For example, I want to collect /about, /contact, /jobs, /news, and /blog
---- But given a set of job listings /jobs/1, /jobs/2, /jobs/3, /jobs/4, /jobs/5, /jobs/6 - i only want the first 3
Not sure where to start with this - would you suggest a crawl profile, a crawl queue, something else??
Thanks
Have you tried to use maximum crawl depth config?
Thanks for the follow up.
I did try max crawl depth, but I actually do not want to limit depth, just breadth for any particular subdirectory.
The solution I came up with is a relatively thin implementation of Spatie\Crawler\CrawlQueue\CrawlQueue
, here's the code in case it helps someone else:
https://gist.github.com/aaronbauman/863c781f48572e644ca6b26d451653a6