Dminor7 / ira

A generalize scraper, webscrape in declarative manner.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Introduction

In development generalize scraper

Getting Started

  1. Installation process
    • Install pipenv
    • Then from the root folder run pipenv install
    • After dependencies are installed run pipenv shell to activate virtual env.

How to build config.json

See the examples given in jobs/ to build a custom config.json file.

  • We can select the elements we want to scrape using following selections:
    • css : The CSS selectors. Example -
    [{
        "name":"likes",
        "selection":"css",
        "search":["span#likes", "p#likes"], // Can specify multiple selectors the one that matches will be considered.
        "first":true, // The first occurence that encountered
        "attribute":"text" // Attribute to get value from
    }]
    /*
        OUTPUT:
        [{
            "likes":"37",
    
        }]
    */
    • xpath : The Xpath of an element. Example -
    [{
        "name":"keywords",
        "selection":"xpath",
        "search":["//span[@id='keywords']", "//p[@id='keywords']"], // Can specify multiple selectors the one that matches will be considered.
        "first":false, // The first occurence that encountered
        "attribute":"text" // Attribute to get value from
    }]
    /*
        OUTPUT:
        [{
            "keywords":["Battle Ropes",
                        "Kettlebells",
                        "BOSU",
                        "Dumbbells",
                        "Jump Ropes",
                        "Medicine Balls",
                        "Plyometric Boxes",
                        "Resistance Bands"],
    
        }]
    */
    • regex : Extracts the element from the pattern. Example -
    [
        {
            "name":"data",
            "selection":"regex",
            "search":["<script>window.__PRELOADED_STATE__ = (.*);</script>"]
        }
    ]
    /*
        OUTPUT:
        [{
            "data":"{\"address\":\"23 avenue fake street\", \"phoneNumber\":\"+1 (000-000-0000)\"}"
        }]
    
        Which later can be converted into python dicionary by json **load** and using **eval** method
    */
    [
        {
            "name":"phone",
            "selection":"find",
            "search":["\"phoneNumber\":\"{}\""],
            "first":true,
            "attribute":"text"
        }
    ]
    /*
    [{
        "phone":"+1 (000-000-0000)"
    }]
    */
    • tables : Extract all tables for a given html. Example -
    [{
        "name":"info_tables",
        "selection":"tables"
    }]
    
    /*
    Ex URL : https://gympricelist.com/title-boxing-club-prices/
    
    Output:
        [
            {
                "info_tables": [
            [
                {
                    "Service": "MONTHLY",
                    "Cost": "MONTHLY"
                },
                {
                    "Service": "SINGLE",
                    "Cost": "SINGLE"
                },
                {
                    "Service": "Initiation Fee",
                    "Cost": "$149.49"
                },
                {
                    "Service": "Monthly Fee",
                    "Cost": "$79.49"
                },
                {
                    "Service": "Cancellation Fee",
                    "Cost": "$0.00"
                },
                {
                    "Service": "TWO ADULTS  (adsbygoogle = window.adsbygoogle || []).push({});",
                    "Cost": "TWO ADULTS  (adsbygoogle = window.adsbygoogle || []).push({});"
                },
                {
                    "Service": "Initiation Fee",
                    "Cost": "$299.49"
                },
                {
                    "Service": "Monthly Fee",
                    "Cost": "$149.49"
                },
                {
                    "Service": "Cancellation Fee",
                    "Cost": "$0.00"
                },
                {
                    "Service": "Yearly",
                    "Cost": "Yearly"
                },
                {
                    "Service": "SINGLE",
                    "Cost": "SINGLE"
                },
                {
                    "Service": "Initiation Fee",
                    "Cost": "$99.49"
                },
                {
                    "Service": "Annual Fee",
                    "Cost": "$719.49"
                },
                {
                    "Service": "Cancellation Fee",
                    "Cost": "$0.00"
                },
                {
                    "Service": "TWO ADULTS",
                    "Cost": "TWO ADULTS"
                },
                {
                    "Service": "Initiation Fee",
                    "Cost": "$199.49"
                },
                {
                    "Service": "Annual Fee",
                    "Cost": "$1439.49"
                },
                {
                    "Service": "Cancellation Fee",
                    "Cost": "$0.00"
                }
            ],
            [
                {
                    "0": "Days",
                    "1": "Hours"
                },
                {
                    "0": "Monday",
                    "1": "8AM–5PM"
                },
                {
                    "0": "Tuesday",
                    "1": "8AM–5PM"
                },
                {
                    "0": "Wednesday",
                    "1": "8AM–5PM"
                },
                {
                    "0": "Thursday",
                    "1": "8AM–5PM"
                },
                {
                    "0": "Friday",
                    "1": "8AM–5PM"
                },
                {
                    "0": "Saturday",
                    "1": "Closed"
                },
                {
                    "0": "Sunday",
                    "1": "Closed"
                }
            ]
        ]
            }
        ]
    */
    • recursive : To iterate over a nested HTML structure recursively. Example -
    [{
        
        "name": "amenities",
        "selection": "recursive",
        "rules": {
            "data(#amenities > div > div)": [
                {
                    "name": "h2",
                    "services(ul)": [
                        "li"
                    ]
                }
            ]
        }
    
    }]
    
    /*  
        Ex URL:https://www.anytimefitness.com/gyms/2863/roseville-ca-95661/
        
        Output: 
        [{
            "amenities": {
            "data": [
                {
                    "name": "Gym Amenities",
                    "services": [
                        "24-Hour Access",
                        "24-Hour Security",
                        "Convenient Parking",
                        "Worldwide Club Access",
                        "Private Restrooms",
                        "Private Showers",
                        "Tanning",
                        "HDTVs",
                        "Health Plan Discounts",
                        "Wellness Programs",
                        "Free Classes"
                    ]
                },
                {
                    "name": "Cardio",
                    "services": [
                        "Treadmills",
                        "Elliptical Cross-trainers",
                        "Spin Bikes",
                        "Cardio TVs",
                        "Exercise Cycles",
                        "Rowing Machines",
                        "Stair Climbers"
                    ]
                },
                {
                    "name": "Strength/Free Weights",
                    "services": [
                        "Free Weights",
                        "Squat Racks",
                        "Plate Loaded",
                        "Circuit/Selectorized",
                        "Dumbbells",
                        "Barbells"
                    ]
                },
                {
                    "name": "Functional Training",
                    "services": [
                        "Battle Ropes",
                        "Kettlebells",
                        "TRX",
                        "BOSU",
                        "Dumbbells",
                        "Jump Ropes",
                        "Medicine Balls",
                        "Plyometric Boxes",
                        "Resistance Bands"
                    ]
                },
                {
                    "name": "Training and Coaching Services",
                    "services": [
                        "Personal Training",
                        "Specialized Classes",
                        "Small Group Training",
                        "Virtual Studio Classes",
                        "Fitness Assessment"
                    ]
                }
            ]
        }
        }]
    
    */

Methods on attributes:

  • href
{
        "name":"website",
        "selection":"css",
        "search":["my selection"],
        "first":true,
        "attribute":"href",
        "extract_from_href":"?url" // Extract Query Parameter, here url
    }
  • text
{
        "name":"total_reviews",
        "selection":"css",
        "search":["my selection"],
        "first":true,
        "attribute":"text",
        "extract_from_text":"-?\\d+\\.?\\d*" // Extract from text, here number
    }

How to Run

Write the custom class in example.py (see examples) inherit the Crawl class and run ExampleClass.run()

Roadmap

  • Add functionality to render HTML(with proxy). By simply putting
render=True
  • Designing API.
  • Integrating Celery.
  • Dynamic Celery Workflow for registered Jobs. Using YAML file. Example
example.MyWorkflow:
  tasks:
    - Google
    - GROUP_1:
        type: group
        tasks:
          - Yelp
          - BBB
          - Manta

About

A generalize scraper, webscrape in declarative manner.


Languages

Language:Python 100.0%