vifreefly / kimuraframework

Kimurai is a modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Include Docker support

Tails opened this issue · comments

commented

Its easy to get up and running using Docker (no need to install a bunch of dependencies on a system that you don't know about).

I got Docker working using the following files:

#Dockerfile
FROM ruby:2.5.3-stretch
RUN gem install kimurai
RUN apt-get update && apt install -q -y git unzip wget tar openssl xvfb chromium \
                                        firefox-esr libsqlite3-dev sqlite3 mysql-client default-libmysqlclient-dev

RUN cd /tmp && \
    wget https://chromedriver.storage.googleapis.com/2.39/chromedriver_linux64.zip && \
    unzip chromedriver_linux64.zip -d /usr/local/bin && \
    rm -f chromedriver_linux64.zip

RUN cd /tmp && \
    wget https://github.com/mozilla/geckodriver/releases/download/v0.21.0/geckodriver-v0.21.0-linux64.tar.gz && \
    tar -xvzf geckodriver-v0.21.0-linux64.tar.gz -C /usr/local/bin && \
    rm -f geckodriver-v0.21.0-linux64.tar.gz

RUN apt install -q -y chrpath libxft-dev libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev && \
    cd /tmp && \
    wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
    tar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
    mv phantomjs-2.1.1-linux-x86_64 /usr/local/lib && \
    ln -s /usr/local/lib/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin && \
    rm -f phantomjs-2.1.1-linux-x86_64.tar.bz2

RUN mkdir -p /app

ADD Gemfile /app

RUN cd /app && bundle install

ENTRYPOINT ['kimurai']

And its docker-compose.yml:

# 'extends' is not supported in version 3
version: '2'

services:

  base:
    build: ./
    entrypoint: /bin/bash
    working_dir: /app
    volumes:
      - ./:/app

  irb:
    extends: base
    entrypoint: irb
    volumes:
      - ./:/app

  kimurai:
    extends: base
    entrypoint: bundle exec kimurai
    volumes:
      - ./:/app

  crawl:
    extends: kimurai
    command: crawl
    volumes:
      - ./:/app

@Tails, would you be interested to make a PR for this?

commented

I will somewhere this week.

How do you use this?

commented

IMHO docker image would be enough

commented

Works for me (developing compilation):
Dockerfile

FROM ruby:2.5.3-stretch
RUN gem install kimurai
RUN apt-get update && apt-get install -q -y git unzip lsof wget tar openssl xvfb chromium \
                                        firefox-esr libsqlite3-dev sqlite3 mysql-client default-libmysqlclient-dev

RUN cd /tmp && \
    wget https://chromedriver.storage.googleapis.com/2.39/chromedriver_linux64.zip && \
    unzip chromedriver_linux64.zip -d /usr/local/bin && \
    rm -f chromedriver_linux64.zip

RUN cd /tmp && \
    wget https://github.com/mozilla/geckodriver/releases/download/v0.21.0/geckodriver-v0.21.0-linux64.tar.gz && \
    tar -xvzf geckodriver-v0.21.0-linux64.tar.gz -C /usr/local/bin && \
    rm -f geckodriver-v0.21.0-linux64.tar.gz

RUN apt install -q -y chrpath libxft-dev libfreetype6 libfreetype6-dev libfontconfig1 libfontconfig1-dev && \
    cd /tmp && \
    wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
    tar -xvjf phantomjs-2.1.1-linux-x86_64.tar.bz2 && \
    mv phantomjs-2.1.1-linux-x86_64 /usr/local/lib && \
    ln -s /usr/local/lib/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin && \
    rm -f phantomjs-2.1.1-linux-x86_64.tar.bz2

RUN mkdir -p /app

ADD Gemfile /app

RUN cd /app && bundle install

Gemfile

source 'https://rubygems.org' do
  gem 'kimurai'
  gem 'byebug'
end

Build

docker build . -t simple-kimurai 

Run (it opens container with installed env. for developing with mounetd current_dir)

docker run --rm -it -v ${PWD}:/app -w /app simple-kimurai bash
commented

It would be great if owner creates oficial docker image.

@seliverstov-maxim Dockerfile is great, but it crashes when running with multithreads

I, [2021-05-07 08:17:08 +0000#1693] [C: 47304296299360]  INFO -- MySpider: Info: visits: requests: 7, responses: 6
D, [2021-05-07 08:17:08 +0000#1693] [C: 47304296299360] DEBUG -- MySpider: Browser: driver.current_memory: 3837
I, [2021-05-07 08:17:08 +0000#1693] [C: 47304296299360]  INFO -- MySpider: Browser: driver selenium_chrome has been destroyed
#<Thread:0x0000560bc78df6c0@/usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:299 run> terminated with exception (report_on_exception is true):
Traceback (most recent call last):
	19: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:305:in `block (2 levels) in in_parallel'
	18: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:305:in `each'
	17: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:313:in `block (3 levels) in in_parallel'
	16: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:204:in `request_to'
	15: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:204:in `public_send'
	14: from a.rb:33:in `try_parse'
	13: from a.rb:52:in `parse_question_page'
	12: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/session.rb:21:in `visit'
	11: from /usr/local/bundle/gems/capybara-3.35.3/lib/capybara/session.rb:278:in `visit'
	10: from /usr/local/bundle/gems/capybara-3.35.3/lib/capybara/selenium/driver.rb:104:in `visit'
	 9: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/navigation.rb:32:in `to'
	 8: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/oss/bridge.rb:52:in `get'
	 7: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/oss/bridge.rb:587:in `execute'
	 6: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/bridge.rb:167:in `execute'
	 5: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:64:in `call'
	 4: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/default.rb:114:in `request'
	 3: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:88:in `create_response'
	 2: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:88:in `new'
	 1: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/response.rb:34:in `initialize'
/usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/response.rb:72:in `assert_ok': unknown error: session deleted because of page crash (Selenium::WebDriver::Error::UnknownError)
from unknown error: cannot determine loading status
from tab crashed
  (Session info: headless chrome=73.0.3683.75)
  (Driver info: chromedriver=2.39.562737 (dba483cee6a5f15e2e2d73df16968ab10b38a2bf),platform=Linux 5.10.25-linuxkit x86_64)
I, [2021-05-07 08:17:08 +0000#1693] [M: 47304283293120]  INFO -- MySpider: Browser: driver selenium_chrome has been destroyed
F, [2021-05-07 08:17:08 +0000#1693] [M: 47304283293120] FATAL -- MySpider: Spider: stopped: {:spider_name=>"MySpider", :status=>:failed, :error=>"#<Selenium::WebDriver::Error::UnknownError: unknown error: session deleted because of page crash\nfrom unknown error: cannot determine loading status\nfrom tab crashed\n  (Session info: headless chrome=73.0.3683.75)\n  (Driver info: chromedriver=2.39.562737 (dba483cee6a5f15e2e2d73df16968ab10b38a2bf),platform=Linux 5.10.25-linuxkit x86_64)>", :environment=>"development", :start_time=>2021-05-07 08:16:42 +0000, :stop_time=>2021-05-07 08:17:08 +0000, :running_time=>"25s", :visits=>{:requests=>7, :responses=>6}, :items=>{:sent=>0, :processed=>0}, :events=>{:requests_errors=>{}, :drop_items_errors=>{}, :custom=>{}}}
I, [2021-05-07 08:17:08 +0000#1693] [C: 47304296275900]  INFO -- MySpider: Browser: driver selenium_chrome has been destroyed
I, [2021-05-07 08:17:08 +0000#1693] [C: 47304296321600]  INFO -- MySpider: Browser: driver selenium_chrome has been destroyed
I, [2021-05-07 08:17:08 +0000#1693] [C: 47304296845720]  INFO -- MySpider: Browser: driver selenium_chrome has been destroyed
Traceback (most recent call last):
	19: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:305:in `block (2 levels) in in_parallel'
	18: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:305:in `each'
	17: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:313:in `block (3 levels) in in_parallel'
	16: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:204:in `request_to'
	15: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:204:in `public_send'
	14: from a.rb:33:in `try_parse'
	13: from a.rb:52:in `parse_question_page'
	12: from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/session.rb:21:in `visit'
	11: from /usr/local/bundle/gems/capybara-3.35.3/lib/capybara/session.rb:278:in `visit'
	10: from /usr/local/bundle/gems/capybara-3.35.3/lib/capybara/selenium/driver.rb:104:in `visit'
	 9: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/navigation.rb:32:in `to'
	 8: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/oss/bridge.rb:52:in `get'
	 7: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/oss/bridge.rb:587:in `execute'
	 6: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/bridge.rb:167:in `execute'
	 5: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:64:in `call'
	 4: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/default.rb:114:in `request'
	 3: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:88:in `create_response'
	 2: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:88:in `new'
	 1: from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/response.rb:34:in `initialize'
/usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/response.rb:72:in `assert_ok': unknown error: session deleted because of page crash (Selenium::WebDriver::Error::UnknownError)
from unknown error: cannot determine loading status
from tab crashed
  (Session info: headless chrome=73.0.3683.75)
  (Driver info: chromedriver=2.39.562737 (dba483cee6a5f15e2e2d73df16968ab10b38a2bf),platform=Linux 5.10.25-linuxkit x86_64)

I'm having the same issues with multithreading inside of a docker container. Code works great on my Mac OS X box.

::WebDriver::Error::UnknownError: unknown error: session deleted because of page crash\nfrom unknown error: cannot determine loading status\nfrom tab crashed\n  (Session info: headless chrome=86.0.4240.111)>", :environment=>"development", :start_time=>2021-07-25 18:06:00.6242447 +0000, :stop_time=>2021-07-25 18:06:18.1101284 +0000, :running_time=>"17s", :visits=>{:requests=>2, :responses=>1}, :items=>{:sent=>0, :processed=>0}, :events=>{:requests_errors=>{}, :drop_items_errors=>{}, :custom=>{}}}
/usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/response.rb:72:in `assert_ok': unknown error: session deleted because of page crash (Selenium::WebDriver::Error::UnknownError)
from unknown error: cannot determine loading status
from tab crashed
  (Session info: headless chrome=86.0.4240.111)
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/response.rb:34:in `initialize'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:88:in `new'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:88:in `create_response'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/default.rb:114:in `request'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/http/common.rb:64:in `call'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/bridge.rb:167:in `execute'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/w3c/bridge.rb:567:in `execute'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/remote/w3c/bridge.rb:59:in `get'
        from /usr/local/bundle/gems/selenium-webdriver-3.142.7/lib/selenium/webdriver/common/navigation.rb:32:in `to'
        from /usr/local/bundle/gems/capybara-3.35.3/lib/capybara/selenium/driver.rb:104:in `visit'
        from /usr/local/bundle/gems/capybara-3.35.3/lib/capybara/session.rb:278:in `visit'
        from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/capybara_ext/session.rb:21:in `visit'
        from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:201:in `request_to'
        from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:313:in `block (3 levels) in in_parallel'
        from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:305:in `each'
        from /usr/local/bundle/gems/kimurai-1.4.0/lib/kimurai/base.rb:305:in `block (2 levels) in in_parallel'

@thanhtoan1196 did you figure out a workaround?

@hjhart @thanhtoan1196 In my case I can't modify certain configurations of my docker container so I added the following flag: --disable-dev-shm-usage and everything worked like a charm. The downside is that now is using /tmp folder and probably your spider will be slower.

Problem is described here: https://stackoverflow.com/questions/53902507/unknown-error-session-deleted-because-of-page-crash-from-unknown-error-cannot

I have put together an updated version for the docker configuration.

https://github.com/iwoogy/kimurai-docker-example

Hope it could help.