Job Market Analysis

Big data science project to automatically find out what skills would be in a job post.

Features

Collect job posts from job portals like Indeed.com, Glassdoor.ca
Analyze job posts by open-source NLP and ML libraries like spaCy, Scikit-learn.
Visualize results with Redash, an open-source business intelligence system.

This project is built based on the Docker containers, so it is easy to deploy to a distributed production environment for handling large-scale processing.

Structure

collectors/ - Data collection module is structured based on a typical Scrapy project. In its submodules you will find out several web spiders to scrape job posts, company reviews and job interviews from Indeed and Glassdoor.
models/ - Data model module defines job and occupation models. This module serves as basis for other modules and in charge of storing data into Postgres database.
analysis/ - Data analysis module is responsible for EDA tasks. There is a submodule notebooks contains various Jupyter Notebooks for job and occupation analysis.
ml/ - Data mining module performs occupation scoring and competency scoring.
datasets/ - Sample datasets of job posts, company reviews, job interviews and ONET database.
results/ - This folder contains some intermediate results in CSV format used by other modules.
ui/ - Data visualization module is responsible for setting up Redash dashboards.
app/ - Main app module is responsible for monitoring other modules to work with each other. This module is also in charge of setting up command lines to run tasks.

Installation

The the following section, you will install different libraries and systems to run on Ubuntu 18.04

Install Python 3.7

sudo apt update
sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.7
sudo apt install python3.7-dev
sudo apt install python3.7-venv

Install Docker

sudo apt-get update
sudo apt-get install
apt-transport-https
ca-certificates
curl
gnupg-agent
software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository
"deb [arch=amd64] https://download.docker.com/linux/ubuntu
$(lsb_release -cs)
stable"
sudo apt-get install docker-ce docker-ce-cli containerd.io
sudo usermod -aG docker ${USER}

Install Docker Compose

sudo curl -L "https://github.com/docker/compose/releases/download/1.25.5/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose

Build from source code

git clone https://github.com/data-catalysis/job-cruncher.git
cd job-cruncher
python3.7 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
make init_systems
make build_systems
make build_app

Run

source .venv/bin/activate
make start_app

Examples

Example 1: Crawl job posts from Indeed with data scientist as a search keyword. Also crawl maximum 50 posts from Vancouver, BC.

python manage.py job crawl indeed --search_kw 'data scientist' --location 'Vancouver, BC' --max_items 50

Example 2: Crawl job posts from Glassdoor with data engineer as a search keyword. Also crawl maximum 100 posts from Toronto, ON.

python manage.py job crawl glassdoor --search_kw 'data engineer’ --location Toronto, ON' --max_items 100 --query toronto-data-engineer-jobs-SRCH_IL.0,7_IC2281069_KO8,21.htm

Example 3: Analyze job posts to get the top 50 frequent bigrams.

python manage.py job analyze bigram --n_top 50

Example 4: Compute the matching occupations of job posts using top 50 frequent bigrams.

python manage.py job occupation-score bigram --k 50 --data_table 'job_occupation'

Web Frontend:

URL: http://ec2-107-23-250-99.compute-1.amazonaws.com/
Credential to login: Email : cmpt733@sfu.ca | Password: cmpt733

Public Dashboards:

Contributors

Nguyen Cao - Github
Madana Krishnan K. V. - Github
Sanjana R. Chauhan - Github
Sumukha B. Balasubramanya - Github

nguyen-cao / job-cruncher