vifreefly / kimuraframework

Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issues with using skip_request_errors

doutatsu opened this issue · comments

I am trying to use the configuration provided, to skip 404 errors, but instead, I am getting Runtime error raised. Perhaps this is the intended behaviour, but I was expecting to get false or empty object, or something? Let me know if I misunderstood the functionality. Here is the configuration:

# frozen_string_literal: true

require 'kimurai'

module Spiders
  class Test < Kimurai::Base
    @name                = 'test_spider'
    @disable_images      = true
    @engine              = :mechanize
    @skip_request_errors = [
      { error: RuntimeError }
    ]

    def parse(response, url:, data: {})
    end
  end
end

If I then run it with Spiders::Test.parse!(:parse, url: 'https://google.com/asdfsdf'), I get back this error:

BrowserBuilder (mechanize): created browser instance
Browser: started get request to: https://google.com/asdfsdf
Browser: driver mechanize has been destroyed
Traceback (most recent call last):
        2: from (irb):2
        1: from (irb):2:in `rescue in irb_binding'
RuntimeError (Received the following error for a GET request to https://google.com/asdfsdf: '404 => Net::HTTPNotFound for https://google.com/asdfsdf -- unhandled response')

Am I doing something wrong or that's expected behaviour? I also tried this for the configuration:
{ error: RuntimeError, message: '404 => Net::HTTPNotFound' }

Hi @doutatsu,

The skip_request_errors is a key to the @config variable. Here's how to write it the right way:

module Spiders
  class Test < Kimurai::Base
    @name   = "test_spider"
    @engine = :mechanize
    @config = {
      disable_images: true,
      skip_request_errors: [{ error: RuntimeError, message: "404 => Net::HTTPNotFound" }]
    }

    def parse(response, url:, data: {})
    end
  end
end

As Mechanize raises a RuntimeError with a more generic message, you can also write it this way:

skip_request_errors: [{ error: RuntimeError, message: "Received the following error" }]

cf. https://github.com/jeroenvandijk/capybara-mechanize/blob/master/lib/capybara/mechanize/browser.rb#L143

I hope this helps!

I thought because you can do @disable_images = true, you can write any of the configuration options in that way, by specifying the option as an instance variable. I'll try out and see if it works

@doutatsu , please check the README here https://github.com/vifreefly/kimuraframework#all-available-config-options . That's the reference for all config options. So like you see you cannot do @disable_images = true, it will not work.

Thanks, @vifreefly, I didn't realise that. I misunderstood how the configuration works. Sorry for the trouble