vifreefly / kimuraframework

Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Crawl in Sidekiq - Selenium::WebDriver::Error::WebDriverError: not a file: "./bin/chromedriver

Mirk32 opened this issue ยท comments

I try to run crawler via Sidekiq job on my DigitalOcean droplet, but always get fail with error Selenium::WebDriver::Error::WebDriverError: not a file: "./bin/chromedriver", in the same time I can run crawl! via rails console and it works well, also it works well via Sidekiq on my local machine. I defined chromedriver_path in the Kimurai initializer - config.chromedriver_path = Rails.root.join('lib', 'webdrivers', 'chromedriver_83').to_s
Logs of the Sidekiq job which I started also via rails console with FekoCrawlWorker.perform_async

Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29T19:43:26.602Z 7201 TID-ou13yz8xx FekoCrawlWorker JID-7d134b4ee9407973d7803f0b INFO: start
Jun 29 19:43:26 aquacraft sidekiq[7201]: I, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140]  #033[36mINFO -- feko_spider:#033[0m Spider: started: feko_spider
Jun 29 19:43:26 aquacraft sidekiq[7201]: D, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140] #033[32mDEBUG -- feko_spider:#033[0m BrowserBuilder (selenium_chrome): created browser instance
Jun 29 19:43:26 aquacraft sidekiq[7201]: D, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140] #033[32mDEBUG -- feko_spider:#033[0m BrowserBuilder (selenium_chrome): enabled native headless_mode
Jun 29 19:43:26 aquacraft sidekiq[7201]: I, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140]  #033[36mINFO -- feko_spider:#033[0m Browser: started get request to: https://feko.com.ua/shop/category/kotly/gazovye-kotly331/page/1
Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29 19:43:26 WARN Selenium [DEPRECATION] :driver_path is deprecated. Use :service with an instance of Selenium::WebDriver::Service instead.
Jun 29 19:43:26 aquacraft sidekiq[7201]: I, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140]  #033[36mINFO -- feko_spider:#033[0m Info: visits: requests: 1, responses: 0
Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29 19:43:26 WARN Selenium [DEPRECATION] :driver_path is deprecated. Use :service with an instance of Selenium::WebDriver::Service instead.
Jun 29 19:43:26 aquacraft sidekiq[7201]: I, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140]  #033[36mINFO -- feko_spider:#033[0m Browser: driver selenium_chrome has been destroyed
Jun 29 19:43:26 aquacraft sidekiq[7201]: F, [2020-06-29 19:43:26 +0000#7201] [C: 70059979631140] #033[1;31mFATAL -- feko_spider:#033[0m Spider: stopped: {#033[35m:spider_name#033[0m=>#033[33m"feko_spider"#033[0m, #033[35m:status#033[0m=>:failed, #033[35m:error#033[0m=>#033[33m"#<Selenium::WebDriver::Error::WebDriverError: not a file: \"./bin/chromedriver\">"#033[0m, #033[35m:environment#033[0m=>#033[33m"development"#033[0m, #033[35m:start_time#033[0m=>#033[36m2020#033[0m-06-29 19:43:26 +0000, #033[35m:stop_time#033[0m=>#033[36m2020#033[0m-06-29 19:43:26 +0000, #033[35m:running_time#033[0m=>#033[33m"0s"#033[0m, #033[35m:visits#033[0m=>{#033[35m:requests#033[0m=>#033[36m1#033[0m, #033[35m:responses#033[0m=>#033[36m0#033[0m}, #033[35m:items#033[0m=>{#033[35m:sent#033[0m=>#033[36m0#033[0m, #033[35m:processed#033[0m=>#033[36m0#033[0m}, #033[35m:events#033[0m=>{#033[35m:requests_errors#033[0m=>{}, #033[35m:drop_items_errors#033[0m=>{}, #033[35m:custom#033[0m=>{}}}
Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29T19:43:26.607Z 7201 TID-ou13yz8xx FekoCrawlWorker JID-7d134b4ee9407973d7803f0b INFO: fail: 0.006 sec
Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29T19:43:26.608Z 7201 TID-ou13yz8xx WARN: {"context":"Job raised exception","job":{"class":"FekoCrawlWorker","args":[],"retry":false,"queue":"default","backtrace":true,"jid":"7d134b4ee9407973d7803f0b","created_at":1593459806.6006012,"enqueued_at":1593459806.6006787},"jobstr":"{\"class\":\"FekoCrawlWorker\",\"args\":[],\"retry\":false,\"queue\":\"default\",\"backtrace\":true,\"jid\":\"7d134b4ee9407973d7803f0b\",\"created_at\":1593459806.6006012,\"enqueued_at\":1593459806.6006787}"}
Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29T19:43:26.608Z 7201 TID-ou13yz8xx WARN: Selenium::WebDriver::Error::WebDriverError: not a file: "./bin/chromedriver"
Jun 29 19:43:26 aquacraft sidekiq[7201]: 2020-06-29T19:43:26.608Z 7201 TID-ou13yz8xx WARN: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/platform.rb:136:in `assert_file'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/platform.rb:140:in `assert_executable'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:138:in `binary_path'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:94:in `initialize'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:41:in `new'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/service.rb:41:in `chrome'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:299:in `service_url'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/chrome/driver.rb:40:in `initialize'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:46:in `new'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/driver.rb:46:in `for'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver.rb:88:in `for'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/capybara-2.18.0/lib/capybara/selenium/driver.rb:23:in `browser'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/selenium/driver.rb:32:in `port'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/selenium/driver.rb:28:in `pid'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/driver/base.rb:16:in `current_memory'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/session.rb:51:in `ensure in visit'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/session.rb:52:in `visit'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/base.rb:201:in `request_to'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/base.rb:128:in `block in crawl!'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/base.rb:124:in `each'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/shared/bundle/ruby/2.6.0/gems/kimurai-1.4.0/lib/kimurai/base.rb:124:in `crawl!'
Jun 29 19:43:26 aquacraft sidekiq[7201]: /home/deploy/aquacraft/releases/20200627190630/app/workers/feko_crawl_worker.rb:9:in `perform'

Sidekiq worker code:

require 'sidekiq-scheduler'

class FekoCrawlWorker
  include Sidekiq::Worker

  sidekiq_options retry: false, backtrace: true, queue: 'default'

  def perform
    Crawlers::Feko.crawl!
  end
end

I have sort of the same issue when I use :selenium_chrome, but on my machine

/Users/kaka/.asdf/installs/ruby/2.5.1/lib/ruby/gems/2.5.0/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/platform.rb:136:in `assert_file': not a file: "/usr/local/bin/chromedriver" (Selenium::WebDriver::Error::WebDriverError)

It works when I use :selenium_firefox

Also check the config, haven't tried it but maybe changing the default location for the webdriver could help https://github.com/vifreefly/kimuraframework#configuration-options

 # Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
  # config.selenium_chrome_path = "/usr/bin/chromium-browser"
  # Provide custom selenium chromedriver path (default is "/usr/local/bin/chromedriver"):
  # config.chromedriver_path = "~/.local/bin/chromedriver"

Thanks @kaka-ruto i tried usng Kimurai.configure and worked as shown below


require 'kimurai'

Kimurai.configure do |config|
  # Default logger has colored mode in development.
  # If you would like to disable it, set `colorize_logger` to false.
  # config.colorize_logger = false

  # Logger level for default logger:
  # config.log_level = :info

  # Custom logger:
  # config.logger = Logger.new(STDOUT)

  # Custom time zone (for logs):
  # config.time_zone = "UTC"
  # config.time_zone = "Europe/Moscow"

  # Provide custom chrome binary path (default is any available chrome/chromium in the PATH):
  # config.selenium_chrome_path = "/usr/bin/chromium-browser"
  # Provide custom selenium chromedriver path (default is "/usr/local/bin/chromedriver"):
  config.chromedriver_path = "/usr/bin/chromedriver"
end

class JobScraper < Kimurai::Base
  @name= 'eng_job_scraper'
  @start_urls = ["https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY"]
  @engine = :selenium_chrome

  @@jobs = []

  def scrape_page
    doc = browser.current_response
    returned_jobs = doc.css('td#resultsCol')
    returned_jobs.css('div.jobsearch-SerpJobCard').each do |char_element|
      title = char_element.css('h2 a')[0].attributes["title"].value.gsub(/\n/, "")
      link = "https://indeed.com" + char_element.css('h2 a')[0].attributes["href"].value.gsub(/\n/, "")
      description = char_element.css('div.summary').text.gsub(/\n/, "")
      company = description = char_element.css('span.company').text.gsub(/\n/, "")
      location = char_element.css('div.location').text.gsub(/\n/, "")
      salary = char_element.css('div.salarySnippet').text.gsub(/\n/, "")
      requirements = char_element.css('div.jobCardReqContainer').text.gsub(/\n/, "")
      # job = [title, link, description, company, location, salary, requirements]
      job = {title: title, link: link, description: description, company: company, location: location, salary: salary, requirements: requirements}

      @@jobs << job if !@@jobs.include?(job)
    end
  end

  def parse(response, url:, data: {})

    10.times do
      scrape_page

      if browser.current_response.css('div#popover-background') || browser.current_response.css('div#popover-input-locationtst')
        browser.refresh
      end

      browser.find('/html/body/table[2]/tbody/tr/td/table/tbody/tr/td[1]/nav/div/ul/li[6]/a/span').click
      puts "๐Ÿ”น ๐Ÿ”น ๐Ÿ”น CURRENT NUMBER OF JOBS: #{@@jobs.count}๐Ÿ”น ๐Ÿ”น ๐Ÿ”น"
      puts "๐Ÿ”บ ๐Ÿ”บ ๐Ÿ”บ ๐Ÿ”บ ๐Ÿ”บ  CLICKED NEXT BUTTON ๐Ÿ”บ ๐Ÿ”บ ๐Ÿ”บ ๐Ÿ”บ "
    end

    CSV.open('jobs.csv', "w") do |csv|
      csv << @@jobs
    end

    File.open("jobs.json","w") do |f|
      f.write(JSON.pretty_generate(@@jobs))
    end

    @@jobs
  end
end

jobs = JobScraper.crawl!

FYI, I am using Archilinux and by default chromedriver is installed in this path '/usr/bin/chromedriver', finally when i ran the code i found another issue related to lsof it tool by default is not installed in Arch so i had to install it from AUR reposittories

yay -S lsof

Now everything looks good :)

Awesome @GarnicaJR ! Glad you got it working.