camratchford / generic-scraper

Generic Scraper is a low code basic web scraper driven by yaml/json config files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Generic Scraper

Status GitHub Pull Requests License


A simple metrics client enabling fetching of metrics via http requests.
Limitless utility through Python user scipts.


This role is currently in early developement and is highly (possibly completely) unstable.

Use at your own risk. Or preferably, wait until it's "done".

Table of Contents

About

Generic Scraper is a low code basic web scraper driven by yaml/json config files

Getting Started

Clone the repo

 cd /opt/
 git clone https://github.com/camratchford/generic_scraper.git

Set up your venv

cd ./generic_scraper
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install .

Create a yaml config file, filling the variables with your own

# Will scrape job postings from a level.co job board
scraper_configs:
  level:
    container_el: div
    container_attr: class
    container_val: postings-wrapper
    list_item_el: a
    list_item_attr: "class"
    list_item_val:
      - posting-title
    href: True
    extract_items:
      - name: title
        tag: h5
        attr: data-qa
        val: posting-name
      - name: link
        tag: a
        attr: href
        val:
      - name: location
        tag: span
        attr: class
        val: sort-by-location
      - name: team
        tag: span
        attr: class
        val: sort-by-team

scraper_urls:
  - url: https://jobs.lever.co/imperfectfoods
    pagination: False
    config: level

  - url: https://jobs.lever.co/cfsenergy
    pagination: False
    config: level

Using the generic_scraper module

from generic_scraper.scraper import Scraper
from generic_scraper.extractor import Extractor
from generic_scraper.config import scraper_config

def main():
    scraper_config.config_from_file(r"C:\Users\cameron\PyCharm Projects\level_scraper\tests\test.yml")
    scraper = Scraper(scraper_config)
    scraper.scrape()
    extractor = Extractor(scraper_config)
    extractor.extract()
    data = extractor.serialize()
    
    print(data)


if __name__ == "__main__":
    main()

Running the tests

There are no tests

Coding Style

PEP-8

Usage

All items are susceptible to change at any moment. Don't use it.

Deployment

As yet untested in production. Use at your own risk.

Built Using

  • FastAPI - ASGI based asynchronous Python web framework
  • Uvicorn - ASGI Web Server

Authors

See also the list of contributors who participated in this project

About

Generic Scraper is a low code basic web scraper driven by yaml/json config files


Languages

Language:Python 100.0%