Pass response callback as block
Tails opened this issue · comments
The one thing that has bothered me about Scrapy is that callbacks can't be given inline to show the visual hierarchy of pages scraped. Ruby however has blocks. Could we do something like this?
request url: 'http://example.com' do |response|
response.at_xpath("//a[@class='next_page']").each do |next_link|
request url: next_link do |response2|
#etc
end
end
end
@Tails Wow, this is interesting design! I think it's possible to extend request_to
method to optionally receive a block just like on your example. I'll try to implement it soon and tell you how it goes.
I actually managed yesterday and it's not too difficult. There is only one major drawback which is that the inline callbacks cannot be called from the CLI (without some more complex promise-based lookup).
Here's how it works (in ApplicationSpider):
# use this class callback to bootstrap from start_urls
def self.request_start(&handler)
define_method(:parse, &handler)
end
# parse page using inline block as callback
def request_in(handler_id, delay = nil, **opts, &handler)
handler_name = "parse_#{handler_id.to_s}_page".to_sym
self.class.send(:define_method, handler_name, &handler) unless self.methods.include? handler_name
request_to handler_name, delay, **opts
end
# request multiple urls inline
def request_all(handler_id, delay = nil, urls:[], **opts, &handler)
urls.each do |url|
request_in(handler_id, delay, url:Linkable.parse(url), **opts, &handler)
end
end
And this is what you can do (define at class-level in Spider):
class YourSpider < ApplicationSpider
# start from start_urls
request_start do |category_list, **opts1|
request_all :product_list, urls: category_list.css(".css-selector1") do |product_list, **opts2|
category_name = product_list.css('h1.main-title').text
for link in product_list.css(".css-selector2")
request_in :product, url: link, data: {category_name: category_name} do |product, **opts3|
item = parse_entity(:product)
save_to "results.json", item, format: :pretty_json
end
end
end
end
end
@Tails
While it is indeed an interesting design, I personally like more the Scrapy's flat style, because it keeps the code clean and flat, without deep code nesting. But it's just my opinion.
Another question is how about 'recursion' pattern which is often the case with pagination? Consider this code:
def parse_category(response, url:, data: {})
response.xpath("//products/path").each do |product|
request_to :parse_product, url: product[:href]
end
if next_page = response.at_xpath("//next_pagination_page/path")
request_to :parse_category, url: next_page[:href]
end
end
Clean and simple. But I think it's possible to implement this feature with block design as well (see example below).
Also I think that request_all
and request_in
should be merged into one method. request_start
better to name as parse
to follow current convention with default def parse
start method. Here is my draft how it can look:
class YourSpider < ApplicationSpider
@name = "your_spider"
@start_urls = ["http://example.com/"]
parse do |response, url|
categories_urls = response.xpath("//categories/path").map(&:href) # Array of urls
request :parse_category, to: categories_urls do |response, url, data|
products_urls = response.xpath("//products/path").map(&:href)
request :parse_product, to: products_urls do |response, url|
item = {}
item[:title] = response.xpath("//path/to/product/title").text.strip
save_to "items.json", item, format: :pretty_json
end
if next_page = response.at_xpath("//path/to/category/next/pagination/page")
request :parse_category, to: next_page[:href] # String, when don't need to pass data
# or
request :parse_category, to: { url: next_page[:href], data: { some: :value }} # Hash, when need to pass data as well
end
end
end
end