vifreefly / kimuraframework

Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wrap the Nokogiri response to reduce boilerplate

Tails opened this issue · comments

commented

Currently, a Nokogiri object is passed as argument to a callback. This results in some boilerplate since some operations have to be defined over and over, like extracting the text, formatting the results, etc.

If instead of the current response a wrapper object was supplied, we could decorate it with some nice utility functions, and let it contain the URL. Wrapping the object would allow the definition of custom selectors besides css an xpath, such as regex, or a composite of any of these. It would also remove the need to specify whether single or multiple elements have to be extracted, similar to Scrapy's extract() and extract_first() of scrapy.Selector.

# in your scraper class
def parse_product_list_page(product_list, url:, data: {})
    product_ids = product_list.regex /"id":([0-9]+)\,/
end
#Page.rb
require 'forwardable'

class Page
  extend Forwardable

  def initialize(response, browser)
    @response = response
    @browser = browser
  end

  # get the current HTML page (fresh)
  def refresh
    @response = @browser.current_response
    self
  end

  #
  # extract methods
  #

  # general purpose entrypoint
  def extract(expression, multi: true, async: false)
    if async
      extract_on_ready(expression, multi: multi)
    elsif multi
      extract_all(expression)
    else
      extract_single(expression)
    end
  end

  # extract first element
  def extract_single(expression, **opts)
    extract_all(expression, **opts).first
  end

  # TODO: wrap results so we can apply a new expression on the subset
  def extract_all(expression, wrap=false)
    query = SelectorExpression.instance(expression)
    # self.send calls the delegated xpath() and css() functions, based on the type of the selector wrapper object ("expression"), which defaults to css
    Array(self.send(query.type, query.to_s))
  end

  def extract_on_ready(expression, multi: true, retries: 3, wait: 1, default: nil)
    retries.times do
      result = extract(expression, multi: multi, async: false)
      case result
      when Nokogiri::XML::Element
        return result
      when Nokogiri::XML::NodeSet, Array
        return result if !result.empty?
      end
      sleep 1
      refresh
    end
    default
  end

  #
  # Nokogiri wrapping
  #

  # delegate functions to the response object so this Page object responds to all classic parsing and selection functions
  def_delegators :@response, :xpath, :css, :text, :children

  def regex(selector)
    @response.text.scan(selector.to_s)
  end

end

Beyond this, it could be a consideration to also wrap the results of xpath() and css() calls, so we would have the same utility functions when doing a subquery:

page.xpath('//').css('.items').regex(/my-regex/)

Here is my thoughts:

It would also remove the need to specify whether single or multiple elements have to be extracted, similar to Scrapy's extract() and extract_first() of scrapy.Selector.

We don't need extract() or extact_first() with Nokogiri, because it has at_xpath and at_css in additional to xpath and css methods.

or a composite of any of these

With Nokogiri you can have selectors chain as well.

I like the idea with auto waiting (extract_on_ready) but still think that it's better to give the user freedom to decide when to update response using browser.current_response. Capybara has a nice methods like has_css? (has_css?("selector", wait: 10)) with wait option (time in seconds how long to wait for the selector before return false if it still will not be found). There are similar methods has_xpath? and has_text? as well.