Best way to limit similar urls

Question

Best way to limit similar urls

aaronbauman opened this issue 3 years ago · comments

Use case:
I'm using a crawler to build a visual regression test battery, and I want to make it efficient.

So, I want to tell the crawler to limit similar URLs.
-- I want to crawl all top-level URLs
-- For each sub-directory, i only want 3 sub-pages
---- For example, I want to collect /about, /contact, /jobs, /news, and /blog
---- But given a set of job listings /jobs/1, /jobs/2, /jobs/3, /jobs/4, /jobs/5, /jobs/6 - i only want the first 3

Not sure where to start with this - would you suggest a crawl profile, a crawl queue, something else??
Thanks

Carlos Monroy · Answer 1 · Mon Nov 15 2021 21:43:23 GMT+0800 (China Standard Time)

Have you tried to use maximum crawl depth config?

Aaron Bauman · Answer 2 · Mon Nov 15 2021 22:44:39 GMT+0800 (China Standard Time)

Thanks for the follow up.
I did try max crawl depth, but I actually do not want to limit depth, just breadth for any particular subdirectory.

The solution I came up with is a relatively thin implementation of Spatie\Crawler\CrawlQueue\CrawlQueue, here's the code in case it helps someone else:
https://gist.github.com/aaronbauman/863c781f48572e644ca6b26d451653a6