Rl16193 / Credit_Risk_Analysis

Credit risk is an inherently unbalanced classification problem, as good loans easily outnumber risky loans. Therefore, you’ll need to employ different techniques to train and evaluate models with unbalanced classes. Using the credit card credit dataset from LendingClub, a peer-to-peer lending services company,

easyensembleclassifier machine-learning oversampling randomforestclassifier supervised-learning undersampling

Credit_Risk_Analysis

Credit risk is an inherently unbalanced classification problem, as good loans easily outnumber risky loans. Therefore, you’ll need to employ different techniques to train and evaluate models with unbalanced classes. Using the credit card credit dataset from LendingClub, a peer-to-peer lending services company, you’ll oversample,undersample and combine over-undersample the data and determine which method provides the best results for logistic regression. Next, we'll compare two new machine learning models that reduce bias, BalancedRandomForestClassifier and EasyEnsembleClassifier, to predict credit risk. Once you’re done, we'll evaluate the performance of these models and make a written recommendation on whether they should be used to predict credit risk.

Summary

Naive Random Oversampling

a. Since the test set has higher number of low risk loan, precision will be an incorrect measure to assess validity of the machine learning algorithm.

b. Precision and F1-Score for low risk loans is 1 and 0.81 respectively

c. Recall for both high risk and low risk loans is 0.68

d. The balanced accuracy is seen as 0.6782

SMOTE Oversampling

a. Since the test set has higher number of low risk loan, precision will be an incorrect measure to assess validity of the machine learning algorithm.

b. Precision and F1-Score for low risk loans is 1 and 0.82 respectively

c. Recall (positive class predictions) is lower for high risk loans 0.59 and higher for low risk loans is 0.69

d. The balanced accuracy is seen as 0.6398

Undersampling - Cluster Centroids

a. Since the test set has higher number of low risk loan, precision will be an incorrect measure to assess validity of the machine learning algorithm.

b. Precision and F1-Score for low risk loans is 1 and 0.61 respectively

c. Recall (positive class predictions) is higher for high risk loans 0.57 and lower for low risk loans is 0.44

d. The balanced accuracy is seen as 0.5063

SMOOTEN - Over and Under Sampling

a. Since the test set has higher number of low risk loan, precision will be an incorrect measure to assess validity of the machine learning algorithm.

b. Precision and F1-Score for low risk loans is 1 and 0.73 respectively

c. Recall (positive class predictions) is higher for high risk loans 0.75 and lower for low risk loans is 0.58

d. The balanced accuracy is seen as 0.6613

Balanced Random Forest Classifier

a. Since the test set has higher number of low risk loan, precision will be an incorrect measure to assess validity of the machine learning algorithm.

b. Precision and F1-Score for low risk loans is 1 and 0.95 respectively

c. Recall (positive class predictions) is lower for high risk loans 0.67 and higher for low risk loans is 0.91

d. The balanced accuracy is seen as 0.7877

Easy Ensemble AdaBoost Classifier

a. Since the test set has higher number of low risk loan, precision will be an incorrect measure to assess validity of the machine learning algorithm.

b. Precision and F1-Score for low risk loans is 1 and 0.96 respectively

c. Recall (positive class predictions) is lower for high risk loans 0.90 and higher for low risk loans is 0.92

d. The balanced accuracy is seen as 0.9105

Summary

The undersampling models have a higher rate positive predictions for high risk loan results while the overall accuracy is lower.
The best classifier is the Easy Ensemble ADABoost classifier with a balanced accuracy of .9105
As stated in the module, using ensemble learners provide better results as they are stronger classifiers.
Modifying the dataset may produce better results in logistic Regression. We can also try SVM classifier.

About

Credit risk is an inherently unbalanced classification problem, as good loans easily outnumber risky loans. Therefore, you’ll need to employ different techniques to train and evaluate models with unbalanced classes. Using the credit card credit dataset from LendingClub, a peer-to-peer lending services company,

easyensembleclassifier machine-learning oversampling randomforestclassifier supervised-learning undersampling

Languages

Language:Jupyter Notebook 100.0%