pszemraj / scrape-viz-jobs

A tool for scraping and clustering job postings from ch.indeed.com; Visualization is completed through various clustering and dimensionality reduction techniques.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Scrape & Viz Job Postings: Switzerland

Scrapes job postings from ch.indeed.com and visualizes them via various clustering and dimensionality techniques. As a result, it's easier to find similar jobs by title or description.


Table of Contents


Updates w.r.t. Original

  • In the "switzerland" folder is a link to an .ipynb file that also links to Colab. It merges the job_scraper.py code with Demo.ipynb from the original project, and makes relevant adjustments for the Swiss version of Indeed, which is mostly just URL syntax.
  • Currently, the CH version only scrapes data from Indeed
  • A link to a Colab version is here. Copy a version to your drive to try out.

Added Features

As the original just pulled and saved an excel file, additional features have been added to make the script more useful:

1 - Integration with Google Drive

  • files are now auto-saved to google drive folder as specified, includes the day's date to keep track
  • files have columns added for date and time pulled incase some sort of larger-scale database creation is useful

2 - Added Visualization

  • k-means visualization: text in field of choice (typically either the job title or the summary) is vectorized and then clustered via unsupervised k-means.
  • Current options for vectorization are TF-IDF or word2vec via the Google News pretrained dataset (available through Gensim)
  • optimal number of k-means clusters determined via elbow method
  • Jobs are then plotted by their dimensionality reduced representation (currently: PCA) and colored by cluster. A custom plotting function (roughly analogous to that TextHero includes built-in but with more features) displays the job data.

viz_sans_labels

  • Text with company name can be added to see distributions (Note: README has static images, but the graphs are plotly scatterplots in HTML and interactive with tooltips)

viz_w_labels

3 - Google Colab Tables

  • uses Google Colab's built-in table feature for dataframes, allowing the user to filter/sort/see job data without needing to exit the notebook

table_example

4 - Link Shortening

  • Allows integration with the pyshorteners package for shortening scraped links (to use for the actual app)
  • Works with bit.ly

Example

In the section below all the function definitions (i.e. main), the code following will return 50 job postings for language = en, job type = internship, and job query = "data":

# define input params for query
desired_characs = ['titles', 'companies', 'links', 'date_listed', 'summary']
jq1="data"
jt1 = "internship"
lan = "en"

# scrape data
chdf1 = find_CHjobs_from(website="indeed", desired_characs=desired_characs,
                         job_query=jq1, job_type=jt1, language=lan)
# process output scraped data
q1_processed = indeed_postprocess(chdf1, query_term=jq1, query_jobtype=jt1,
                       shorten_links=False, download_excel=True)
# display Colab data table
data_table.DataTable(indeed_datatable(q1_processed),
                     include_index=False, num_rows_per_page=20)

# generate viz
viz1 = q1_processed.copy()
viz1.drop(columns=["links", "short_link"], inplace=True)
viz_job_data_word2vec(viz1, "summary", save_plot=True, show_text=True,
                      query_name=jt1 + " in " + jq1)

Details on Querying

The following describes possible input params to find_CHjobs_from():

    - Website: to specify which website to search
        - (options: 'indeed' or 'indeed_default')
    - job_query: words that you want to narrow down the jobs to.
        - for example 'data'
    - job_type:
        - 'internship' or 'fulltime' or 'permanent'
    - language:
        - 'en' or 'de' or other languages.. 'fr'? ew
    - Desired_characs: what columns of data do you want to extract? options are:
        - 'titles', 'companies', 'links', 'date_listed', 'summary'
    - Filename: default is "JS_test_results.xls", can be changed to whatever

Source

Credit to the original repo and medium post - see below.


Everything below here is a copy of the original repo README

Original Repo

Scraping jobs from Indeed or CWjobs

The module job-scraper.py enables you to web scrape job postings from Indeed.co.uk or CWjobs.co.uk.

Both require the package Beautiful Soup. For CWjobs, the Selenium web driver is also required. These can be installed as follows:

$ pip install beautifulsoup4
$ pip install selenium

To use this module, import the job_scraper.py file and call the funciton "find_jobs_from()", which takes in several arguments. For an explanation and demonstration of the required arguments, see Demo.ipynb.

Terms and conditions

I do not condone scraping data from Indeed or CWjobs in any way. Anyone who wishes to do so should first read their statements on scraping software here and here.

Using the selenium web driver

At present, the default browser is set as Google Chrome. This can be modified within job_scraper.py.

In order to extract jobs from CWjobs using Selenium, the appropriate driver must be installed. The driver in this repository is for Google Chrome version 81. See this link to download an appropriate driver for the Google Chrome browser, if required, and place it in the same directory as the job-scraper.py function.

Accompanying blog post

A full description of this code and the process I followed to write it is available here.

About

A tool for scraping and clustering job postings from ch.indeed.com; Visualization is completed through various clustering and dimensionality reduction techniques.

License:MIT License


Languages

Language:Jupyter Notebook 87.4%Language:Python 12.6%