How to limit the search depth level?

Question

How to limit the search depth level?

killernova opened this issue 5 years ago · comments

Like other scrap frameworks, e.g. Colly in Go

c := colly.NewCollector(
		// MaxDepth is 1, so only the links on the scraped page
		// is visited, and no further links are followed
		colly.MaxDepth(1),
	)

Victor Afanasev · Answer 1 · Mon Jun 03 2019 16:30:09 GMT+0800 (China Standard Time)

Hello @killernova !

By default, Kimurai have no idea how to crawl a particular website. There is no option for auto-crawling (following all the links on a website automatically without writing a selector first), and this approach is pretty similar to Scrapy framework.

I can recommend you to check out some other alternatives in Ruby here https://github.com/lorien/awesome-web-scraping/blob/master/ruby.md#web-scraping-frameworks

killernova · Answer 2 · Mon Jun 03 2019 16:42:15 GMT+0800 (China Standard Time)

Thanks. But it seems that Scrapy also supports to limit max depth with middleware.

DepthMiddleware
class scrapy.spidermiddlewares.depth.DepthMiddleware
DepthMiddleware is used for tracking the depth of each Request inside the site being scraped. It works by setting request.meta[‘depth’] = 0 whenever there is no value previously set (usually just the first Request) and incrementing it by 1 otherwise.
It can be used to limit the maximum depth to scrape, control Request priority based on their depth, and things like that.
The DepthMiddleware can be configured through the following settings (see the settings documentation for more info):
DEPTH_LIMIT - The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.
DEPTH_STATS_VERBOSE - Whether to collect the number of requests for each depth.
DEPTH_PRIORITY - Whether to prioritize the requests based on their depth.

Victor Afanasev · Answer 3 · Mon Jun 03 2019 16:49:27 GMT+0800 (China Standard Time)

You're right, but there is no similar feature in Kimurai at the moment.