chuchro3 / WebCrawler

Web Crawler for Workday job postings

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

WebCrawler

OVERVIEW:
This is a web crawler intended to scrape job postings given a workday job postings URL. The files are stored by job posting ID, and contain a json with a detailed description of the posting from the given sub-urls, as well as notable labels pulled from the original posting description containing info like job title, location, posted date in a list.
Once we get all job URLs on the first page, retrieving the details of each job posting from branching URLs can be done in parallel (see options below).
The crawler has been successfully tested with 3 different workday job posting websites. It should readily expand to more!



CONTENTS:

  • crawler.py
    main logic for scraping workday job postings, as well as starting the main program
  • util.py
    utility functions used by the crawler. not crawler dependent



USAGE: python3 crawler.py EXAMPLES:

Options:

  • -h, --help show this help message and exit

  • -u MAIN_LINK, --url=MAIN_LINK Job Posting URL [Default: https://mastercard.wd1.myworkdayjobs.com/CorporateCareers]

  • -d DEST_DIR, --dest=DEST_DIR Destination Directory [Default: ./test]

  • -t THREAD_COUNT, --threads=THREAD_COUNT Number of parallel threads [Default: 4]

  • -v, --verbose Verbose output to sdout [Default: False]

About

Web Crawler for Workday job postings


Languages

Language:Python 100.0%