Parth4786 / Data-_Scientist_-Salary_-Prediction

This is a GitHub repository for “Data Scientist Salary Prediction”. It predicts salaries based on location, experience, education, skills, and industry. It uses machine learning algorithms such as linear regression, decision trees, and random forests. The project is open-source and written in Python.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DSSP Dataset Python 3.6 library

Project Overview

• Created a machine learning model that estimates salary of data scientist based on the features like rating, company_founded, etc.
• Engineered features from the text of each job description to quantify the value companies put on python, excel, tableau and sql

How will this project help?

• This project helps data scientist/analyst to negotiate their income for an existing or a new job

Resources Used

• Packages: pandas, numpy, sklearn, matplotlib, seaborn.
• Dataset by Ken Jee: https://github.com/PlayingNumbers/ds_salary_proj

Exploratory Data Analysis (EDA) and Data Cleaning

Removed unwanted columns: 'Unnamed: 0'
Plotted bargraphs and countplots for numerical and categorical features respectively for EDA
Numerical Features (Rating, Founded): Replaced NaN or -1 values with mean or meadian based on their distribution
rating rating1
Categorical Features: Replaced NaN or -1 values with 'Other'/'Unknown' category
Removed unwanted alphabet/special characters from Salary feature
Converted the Salary column into one scale i.e from (per hour, per annum, employer provided salary) to (per annum)

Feature Engineering

Creating new features from existing features e.g. job_in_headquaters from (job_location, headquarters), etc.
jih
• Trimming columns i.e. Trimming features having more than 10 categories to reduce the dimensionality
Handling ordinal and nominal categorical features
• Feature Selection using information gain (mutual_info_regression) and correlation matrix
infogain
corr1
• Feature Scaling using StandardScalar

Model Building and Evaluation

Metric: Negative Root Mean Squared Error (NRMSE)
• Multiple Linear Regression: -27.523
• Lasso Regression: -27.993
Random Forest: -17.637
• Gradient Boosting: -24.429
• Voting (Random Forest + Gradient Boosting): -19.136
Note: Evaluation scores are obtained using cross validation.

Model Prediction

Prediction

About

This is a GitHub repository for “Data Scientist Salary Prediction”. It predicts salaries based on location, experience, education, skills, and industry. It uses machine learning algorithms such as linear regression, decision trees, and random forests. The project is open-source and written in Python.


Languages

Language:Jupyter Notebook 100.0%