agron2017 / arbk-scraper

A Watir ruby script to scrape business registration data from ARBK's website.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ARBK Scraper

A Watir ruby script to scrape business registration data from ARBK's website.

Requirements

  • MongoDB: to persist the scraped data.
  • ruby: to run the ruby script.
  • ruby-dev: to install the ruby mongo driver.
  • Make: to install ruby gems.
  • zlib: we need to install the watir-nokogiri gem which depends on zlib (or else we get the error: "zlib is missing; necessary for building libxml2").
  • ChromeDriver - WebDriver for Chrome: to interact with the Chrome driver via the watir ruby gem.

Ruby Dependencies

  • rubygems.
  • mongo-ruby-driver: a mongo driver.
  • watir: interface to script interactions with the Chrome browser.
  • nokogiri: an HTML, XML, SAX, and Reader parser. Among Nokogiri's many features is the ability to search documents via XPath or CSS3 selectors.

Possible Errors

Errors can occur during the scraping process. The following is a list of possible errors.

  1. no such window: target window already closed\nfrom unknown error: web view not found\n (Session info: chrome=57.0.2987.110)\n (Driver info: chromedriver=2.28.455517 (2c6d2707d8ea850c862f04ac066724273981e88f),platform=Mac OS X 10.12.3 x86_64).
  2. unknown error: Element is not clickable at point (93, 334). Other element would receive the click: <li class="sf-megamenu-wrapper odd sf-item-1 sf-depth-1 sf-total-children-5 sf-parent-children-5 sf-single-children-0 menuparent">...
  3. \n (Session info: chrome=57.0.2987.110)\n (Driver info: chromedriver=2.28.455517 (2c6d2707d8ea850c862f04ac066724273981e88f),platform=Mac OS X 10.12.3 x86_64).
  4. Net::ReadTimeout.
  5. undefined local variable or method `browser' for main:Object.
  6. unexpected alert open: {Alert text : [object Object]}\n (Session info: chrome=57.0.2987.110)\n (Driver info: chromedriver=2.28.455517 (2c6d2707d8ea850c862f04ac066724273981e88f),platform=Mac OS X 10.12.3 x86_64).
  7. timed out after 30 seconds, waiting for #<Watir::TextField: located: false; {:id=>"MainContent_ctl00_txtNumriBiznesit", :tag_name=>"input"}> to be located.
  8. no such session\n (Driver info: chromedriver=2.28.455517 (2c6d2707d8ea850c862f04ac066724273981e88f),platform=Mac OS X 10.12.3 x86_64).
  9. Too many failed attempts to load search page: Net::ReadTimeout.
  10. timed out after 30 seconds, waiting for #<Watir::Anchor: located: false; {:xpath=>"//table[@class='views-table cols-4']/tbody//td/a", :tag_name=>"a"}> to be located.
  11. Too many failed attempts to load page via anchor click: timed out after 30 seconds, waiting for #<Watir::Anchor: located: false; {:xpath=>"//table[@class='views-table cols-4']/tbody//td/a", :tag_name=>"a"}> to be located.
  12. unknown error: Element <input name="ctl00$MainContent$ctl00$Submit1" type="submit" id="MainContent_ctl00_Submit1" value="Kërko"> is not clickable at point (93, 275). Other element would receive the click:
    ...
    \n (Session info: chrome=57.0.2987.110)\n (Driver info: chromedriver=2.28.455517 (2c6d2707d8ea850c862f04ac066724273981e88f),platform=Mac OS X 10.12.3 x86_64).
  13. browser window was closed.

You can count how many of each error type occurs with the following query:

db.errors.aggregate([
  {$group : 
    { _id : '$errorMsg', count : {$sum : 1}}
  }
]).pretty()

About

A Watir ruby script to scrape business registration data from ARBK's website.

License:GNU General Public License v3.0


Languages

Language:Ruby 100.0%