vifreefly / kimuraframework

Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to limit the search depth level?

killernova opened this issue · comments

Like other scrap frameworks, e.g. Colly in Go

c := colly.NewCollector(
		// MaxDepth is 1, so only the links on the scraped page
		// is visited, and no further links are followed
		colly.MaxDepth(1),
	)

Hello @killernova !

By default, Kimurai have no idea how to crawl a particular website. There is no option for auto-crawling (following all the links on a website automatically without writing a selector first), and this approach is pretty similar to Scrapy framework.

I can recommend you to check out some other alternatives in Ruby here https://github.com/lorien/awesome-web-scraping/blob/master/ruby.md#web-scraping-frameworks

Thanks. But it seems that Scrapy also supports to limit max depth with middleware.

DepthMiddleware
class scrapy.spidermiddlewares.depth.DepthMiddleware
DepthMiddleware is used for tracking the depth of each Request inside the site being scraped. It works by setting request.meta[‘depth’] = 0 whenever there is no value previously set (usually just the first Request) and incrementing it by 1 otherwise.
It can be used to limit the maximum depth to scrape, control Request priority based on their depth, and things like that.
The DepthMiddleware can be configured through the following settings (see the settings documentation for more info):
DEPTH_LIMIT - The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.
DEPTH_STATS_VERBOSE - Whether to collect the number of requests for each depth.
DEPTH_PRIORITY - Whether to prioritize the requests based on their depth.

You're right, but there is no similar feature in Kimurai at the moment.