GR8505 / Credit_Risk

Performed supervised machine learning using oversampling, undersampling and combination sampling techniques to determine credit risk for bank customers.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Credit Risk


Executive Summary


In this project, I experimented with the following sampling techniques to predict which banking customers were deemed as high-risk:

Oversampling

  • Naive Random Sampling
  • Synthetic Minority Oversampling Technique (SMOTE)

Undersampling

  • Cluster Centroid Undersampling

Combination of Oversampling and Undersampling

  • Synthetic Minority Oversampling Technique combined with Edited Nearest Neighbors (SMOTEENN)

Based on the results of the abovementioned sampling techniques, it was difficult to determine which model was the best. The combination sampling technique (SMOTEENN) was probably the best out of the lot, but there was a huge disparity between the Precision and Sensitivity scores in predicting High-Risk customers.

This technique was good at ruling out High-Risk customers but the low Precision score of 0.01 was indicative that this model falsly categorizes a large number of Low-Risk customers as High-Risk. Therefore, in my opinion, I will not go with any of the models. While, it does a good job at detecting High-Risk customers, arguably it does too well of a job to the extent that a large number of Low-Risk customers will be denied access to loan facilities.

Refer to the following link for the various steps for cleaning the data and execution of sampling methods.


Objectives


  • Implement different ML models
  • Use Resampling to address class imbalance
  • Evaluate the performance of ML models

Resources


  • Python
  • numpy
  • pandas
  • sklearn
  • imblearn
  • plotly


  • Balanced Random Forest Classifier
  • Easy Ensemble Ada Boost Classifier

Both of these classification methods did not improve the model. Its perfect scores for both Precision and Recall when it comes to predicting Low-Risk customers is an indication of an overfitted model. Moreover, the low Recall score for High-Risk customers means that these methods are both prone to misclassifying a fair amount of High-Risk customers to Low-Risk customers.


Oversampling


Naive Random Sampling

Based on this first sampling technique, the Precision for High-Risk customers is very low but Sensitivity is respectable at 0.71. Precision for Low-Risk customers are too good to be true at 1. The Recall score is 0.58. Overall the accuracy of this model is good at 0.65.


SMOTE

In this oversampling technique the Precision scores for both High-Risk and Low-Risk customers are unchanged. Sensitivity score for Low-Risk customers improves slightly but declines for High-Risk customers. Accuracy remains at 0.65.


Undersampling


Cluster Centroid Undersampling

In this undersampling technique there is no significant improvement in the scores for both High-Risk and Low-Risk customers. Furthermore, the accuracy of the model is worse at 0.54.


Combination Sampling


SMOTEENN

This combination sampling technique yields the best Sensitivity score for High-Risk customers. However, there is no improvement in the Precision score for detecting High-Risk customers. The small F1 score for High-Risk is a clear indication of the large disparity between the Precision and Sensitivity scores for High-Risk customers. Accuracy improved to 0.64.


About

Performed supervised machine learning using oversampling, undersampling and combination sampling techniques to determine credit risk for bank customers.


Languages

Language:Jupyter Notebook 100.0%