ada-boost-classifier data-science decision-tree-classifier employee-attrition employee-management gradient-boosting hyperparameter-optimization machine-learning machine-learning-classification random-forest randomsearch-cv scikit-learn sklearn sklearn-classifier xgboost xgboost-classifier

Exploratory Data Analysis and Employee Attrition Prediction

--

About

In business, employee attrition is when employees leave the company for whatever reason, either they've found a new job or retired, and haven't been replaced immediately.

For a company to be successful, it needs not only to attract top talent but it also needs to retain these talents. For this reason, I have the job to look into a dataset containing information regarding a certain company's employee list to try to find patterns that may provide useful information in understanding why employees leave.

In this notebook, I first treated the data, looking for missing values and attributing names for categorical variables that were previously labeled as numbers.

After doing so, I used Plotly lib for data visualization, which drives us to conclusions and helps us find patterns among employee attrition.

For building a classification model, I have split the data into training a testing sets and did all treatment necessary such as variable encoding, rescaling features, and dealing with imbalanced classes. Lastly, I performed a search for the best hyperparameters setting and tested the models again, trying to achieve better scores.

Conclusion

Through RandomizedSearchCV, we could improve the accuracy score for the Gradient Boosting Classifier, scoring 88.44%, the highest among all models.

Yet, the first Ada Boost Classifier model maintained the best recall score of 62.69%, correctly predicting the largest amount of employees who were more likely to leave, while having a good accuracy score (87.53%). After tuning Ada Boost, we reached an 83.58% recall score, but we lost a lot of accuracy and had a large increase in false positives, possibly indicating that the algorithm became biased towards identifying employees as likely to leave most of the time.

Kaggle

I originally posted this notebook to Kaggle, where Plotly graphs are interactive. I highly suggest you click here to see this project on Kaggle and I'd love it if you leave a comment and an upvote.

I appreciate suggestions and recommendations that may help me improve my work :)

Thank you so much!

Author

Luís Fernando Torres

About

Exploratory data analysis and machine learning classification models to predict employee attrition.

https://www.kaggle.com/code/lusfernandotorres/eda-and-employee-attrition-prediction/notebook

ada-boost-classifier data-science decision-tree-classifier employee-attrition employee-management gradient-boosting hyperparameter-optimization machine-learning machine-learning-classification random-forest randomsearch-cv scikit-learn sklearn sklearn-classifier xgboost xgboost-classifier

Languages

Language:Jupyter Notebook 100.0%