ardbramantyo/MachineLearning-Employee-Attrition

artificial-neural-networks confusion-matrix deep-learning keras-tensorflow logistic-regression machine-learning model-accuracy random-forest

Comparative Study of 3 Different Machine Learning Techniques to Predict Employee Attrition

Data Source: Kaggle

Overview

The project is aimed to develop Machine Learning models and make comparative prediction from "IBM HR Analytics Employee Attrition & Performance" fictional data (1470 rows of data) that could better predict in employee attrition.

Tools: Pandas, Numpy, Seaborn, Matplotlib, Scikit-Learn, Tensorflow, Keras

Exploratory Data Analysis

Data Cleaning

To avoid AI misunderstanding when interpreting data, 2 variables (X) are made based on their data type and converting categorical variable (X_cat) into numerical using scikit-learn and concatenate both of them back.

Variables:

Categorical(X_cat): Anything from fields exclude Attrition that has object data type
Numerical(X_numerical): Anything from fields that has numerical data type.

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder()
X_cat = onehotencoder.fit_transform(X_cat).toarray()

Machine Learning Methods Used for This Case:

Logistic Regression
Random Forest
Deep Learning Model

Accuracy Measurement Method:

Training: 1102 (75%)
Test: 368 (25%)

1. Logistics Regression Model

Logistic regression is best used to predict binary outputs with two possible values labeled "0" or "1".
Logistic model output can be one of two classes: stayed/left, pass/fail, win/lose, etc.
Logistic regression algorithm works by implementing a linear equation first with independent predictors to predict a value.

model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

2. Random Forest Classifier Model

Decision Trees are supervised Machine Learning technique where the data is split according to a certain condition/parameter.
Random Forest Classifier is a type of ensemble algorithm.
It creates a set of decision trees from randomly selected subset of training set.
It then combines votes from different decision trees to decide the final class of the test object.

model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

3. Deep Learning Model

Parameter for training:

Input layer = 50 (from table fields)
Hidden layer = 3 layers (dense, 500 neurons each, relu activation function)
Output = 1 (sigmoid activation function)
Epochs = 100
Batch size = 50

Deep Learning Performance

Confusion Matrix Comparison

Confusion Matrix: Logistic Regression(left), Random Forest(mid), and Deep Learning(right)

Method	Accuracy (%)
Logistic Regression	89
Random Forest	85
Deep Learning	83

Conclusion

Based on analysis with 3 different Machine Learning Methods, Logistic Regression has highest Accuracy (89%) and best suitable to be applied to predict employee attriction.

Reference:

About

The project tries to develop & compare 3 different Machine Learning methods that could better predict in employee attrition.

artificial-neural-networks confusion-matrix deep-learning keras-tensorflow logistic-regression machine-learning model-accuracy random-forest

Languages

Language:Jupyter Notebook 100.0%