NeilGraham / urlscrub

Scrape a webpage into JSON and RDF given a URL. Able to be extended by adding support for more domains in `./domains`.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

URL Scrub

Tool for parsing a URL webpage into JSON + RDF.

Setup

Dependencies

Installation Process

  1. Install urlscrub with pip

    python3.10 -m pip install urlscrub
  2. Install geckodriver

    • Download Firefox and install.

      • Linux (Ubuntu):

        sudo apt-get install firefox
    • Download geckodriver.zip.

    • Unzip geckodriver/geckodriver.exe file into a preferred directory.

    • Append the directory containing geckodriver to your PATH variable. (Guide)

  3. Install chromedriver

    • Download Google Chrome and install.

    • Find the version of Google Chrome you have installed.

      • Open Google Chrome web browser.

      • Click on 3 vertical dots at top right. (Picture)

      • At the bottom of the dropdown, select Help, then About Google Chrome. (Picture)

      • Remember the version number displayed (Picture; Ex: 102.0.5005.115)

    • Download chromedriver.zip with the most corresponding version number.

      • Exact version number not required (Ex: chromedriver 102.0.5005.61 w/ Google Chrome 102.0.5005.115)
    • Unzip chromedriver/chromedriver.exe file into a preferred directory.

    • Append the directory containing chromedriver to your PATH variable. (Guide)

Command Line Usage

  • Command:

    urlscrub --skip-cookies --driver "chrome" -l "https://www.amazon.com/All-new-Kindle-Oasis-now-with-adjustable-warm-light/dp/B07GRSK3HC"
  • Response:

    {
      "results": [
        {
          "type": "product",
          "productTitle": "Kindle Oasis \u2013 With adjustable warm light",
          "availability": "In Stock.",
          "rating": "19,734 ratings",
          "imageURL": "https://m.media-amazon.com/images/I/614TlIaYBvL._AC_SX679_.jpg"
        }
      ]
    }

Guides

About

Scrape a webpage into JSON and RDF given a URL. Able to be extended by adding support for more domains in `./domains`.


Languages

Language:Python 97.0%Language:Shell 3.0%