Wrap the Nokogiri response to reduce boilerplate
Tails opened this issue · comments
Currently, a Nokogiri object is passed as argument to a callback. This results in some boilerplate since some operations have to be defined over and over, like extracting the text, formatting the results, etc.
If instead of the current response
a wrapper object was supplied, we could decorate it with some nice utility functions, and let it contain the URL. Wrapping the object would allow the definition of custom selectors besides css an xpath, such as regex, or a composite of any of these. It would also remove the need to specify whether single or multiple elements have to be extracted, similar to Scrapy's extract()
and extract_first()
of scrapy.Selector.
# in your scraper class
def parse_product_list_page(product_list, url:, data: {})
product_ids = product_list.regex /"id":([0-9]+)\,/
end
#Page.rb
require 'forwardable'
class Page
extend Forwardable
def initialize(response, browser)
@response = response
@browser = browser
end
# get the current HTML page (fresh)
def refresh
@response = @browser.current_response
self
end
#
# extract methods
#
# general purpose entrypoint
def extract(expression, multi: true, async: false)
if async
extract_on_ready(expression, multi: multi)
elsif multi
extract_all(expression)
else
extract_single(expression)
end
end
# extract first element
def extract_single(expression, **opts)
extract_all(expression, **opts).first
end
# TODO: wrap results so we can apply a new expression on the subset
def extract_all(expression, wrap=false)
query = SelectorExpression.instance(expression)
# self.send calls the delegated xpath() and css() functions, based on the type of the selector wrapper object ("expression"), which defaults to css
Array(self.send(query.type, query.to_s))
end
def extract_on_ready(expression, multi: true, retries: 3, wait: 1, default: nil)
retries.times do
result = extract(expression, multi: multi, async: false)
case result
when Nokogiri::XML::Element
return result
when Nokogiri::XML::NodeSet, Array
return result if !result.empty?
end
sleep 1
refresh
end
default
end
#
# Nokogiri wrapping
#
# delegate functions to the response object so this Page object responds to all classic parsing and selection functions
def_delegators :@response, :xpath, :css, :text, :children
def regex(selector)
@response.text.scan(selector.to_s)
end
end
Beyond this, it could be a consideration to also wrap the results of xpath() and css() calls, so we would have the same utility functions when doing a subquery:
page.xpath('//').css('.items').regex(/my-regex/)
Here is my thoughts:
It would also remove the need to specify whether single or multiple elements have to be extracted, similar to Scrapy's
extract()
andextract_first()
of scrapy.Selector.
We don't need extract()
or extact_first()
with Nokogiri, because it has at_xpath and at_css in additional to xpath
and css
methods.
or a composite of any of these
With Nokogiri you can have selectors chain as well.
I like the idea with auto waiting (extract_on_ready) but still think that it's better to give the user freedom to decide when to update response using browser.current_response. Capybara has a nice methods like has_css? (has_css?("selector", wait: 10)
) with wait
option (time in seconds how long to wait for the selector before return false if it still will not be found). There are similar methods has_xpath? and has_text? as well.