vifreefly / kimuraframework

Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pass response callback as block

Tails opened this issue · comments

commented

The one thing that has bothered me about Scrapy is that callbacks can't be given inline to show the visual hierarchy of pages scraped. Ruby however has blocks. Could we do something like this?

request url: 'http://example.com' do |response|
    response.at_xpath("//a[@class='next_page']").each do |next_link|
        request url: next_link do |response2|
            #etc
        end
    end
end

@Tails Wow, this is interesting design! I think it's possible to extend request_to method to optionally receive a block just like on your example. I'll try to implement it soon and tell you how it goes.

commented

I actually managed yesterday and it's not too difficult. There is only one major drawback which is that the inline callbacks cannot be called from the CLI (without some more complex promise-based lookup).

Here's how it works (in ApplicationSpider):

# use this class callback to bootstrap from start_urls
def self.request_start(&handler)
  define_method(:parse, &handler)
end

# parse page using inline block as callback
def request_in(handler_id, delay = nil, **opts, &handler)
  handler_name = "parse_#{handler_id.to_s}_page".to_sym
  self.class.send(:define_method, handler_name, &handler) unless self.methods.include? handler_name
  request_to handler_name, delay, **opts
end

# request multiple urls inline
def request_all(handler_id, delay = nil, urls:[], **opts, &handler)
  urls.each do |url|
    request_in(handler_id, delay, url:Linkable.parse(url), **opts, &handler)
  end
end

And this is what you can do (define at class-level in Spider):

class YourSpider < ApplicationSpider
  # start from start_urls
  request_start do |category_list, **opts1|
      request_all :product_list, urls: category_list.css(".css-selector1") do |product_list, **opts2|
        category_name = product_list.css('h1.main-title').text
        for link in product_list.css(".css-selector2")
          request_in :product, url: link, data: {category_name: category_name} do |product, **opts3|
            item = parse_entity(:product)
            save_to "results.json", item, format: :pretty_json
          end
        end
      end
    end
end

@Tails
While it is indeed an interesting design, I personally like more the Scrapy's flat style, because it keeps the code clean and flat, without deep code nesting. But it's just my opinion.

Another question is how about 'recursion' pattern which is often the case with pagination? Consider this code:

def parse_category(response, url:, data: {})
  response.xpath("//products/path").each do |product|
    request_to :parse_product, url: product[:href]
  end

  if next_page = response.at_xpath("//next_pagination_page/path")
    request_to :parse_category, url: next_page[:href]
  end
end

Clean and simple. But I think it's possible to implement this feature with block design as well (see example below).

Also I think that request_all and request_in should be merged into one method. request_start better to name as parse to follow current convention with default def parse start method. Here is my draft how it can look:

class YourSpider < ApplicationSpider
  @name = "your_spider"
  @start_urls = ["http://example.com/"]

  parse do |response, url|
    categories_urls = response.xpath("//categories/path").map(&:href) # Array of urls
    request :parse_category, to: categories_urls do |response, url, data| 
      products_urls = response.xpath("//products/path").map(&:href)
      request :parse_product, to: products_urls do |response, url| 
        item = {}
        item[:title] = response.xpath("//path/to/product/title").text.strip  
        save_to "items.json", item, format: :pretty_json
      end

      if next_page = response.at_xpath("//path/to/category/next/pagination/page")
        request :parse_category, to: next_page[:href] # String, when don't need to pass data
        # or 
        request :parse_category, to: { url: next_page[:href], data: { some: :value }} # Hash, when need to pass data as well
      end
    end
  end
end