SoumyaAbraham / Credit_Risk_Analysis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Credit_Risk_Analysis

Credit Risk sees good loans outnumbering risky ones. Therefore, it is important to employ various training and evaluation techniques so that the model can get a good understanding of the data.

DELIVERABLE 1

In this project, we will be using imbalanced-learn and scikit-learn libraries to build and evaluate models using resampling.

We will evaluate three machine learning models and determine which is the best for predicting credit risk. You can find the code for this part of the project here

The steps involved in this analysis are as follows:

Before we start, import all the dependencies for this project.

STEP 1: Transform the data into a usable form which involves:

  • Loading the data
  • Dropping NULL values from colummns and rows
  • Converting strings to numerical datatypes
  • Converting target column values to High Risk and Low Risk based on their values

transform

STEP 2: Split the data into Training and Testing sets

split

Going a little further, we can

  • Check the balance of target values
  • Check the shape of the X training set

balance and shape

STEP 3: Oversampling: Here you will compare two oversampling algorithms to determine which perfomrs better.

  • Using Naive Random Oversampling

Naive Oversampling

  • Using SMOTE Oversampling

SMOTE Oversampling

STEP 4: Undersampling: Let us use Cluster Centroids Algorithm here.

Undersampling

DELIVERABLE 2

STEP 5: Over and Under Sampling (SMOTEENN)

SMOTEENN

DELIVERABLE 3

Here, we will use imblearn.ensemble, _BalancedRandomForestClassifier and EasyEnsembleClassifier to predict credit risk and evaluate each model.

You can find the code for this part of the project here

Before we start, ensure you have installed all the necessary libraries. If not, do a quick pip install imblearn and pip install -U scikit-learn. Bring in all the dependencies as well.

STEP 1: Much like the before, bring in the CSV and clean it up so it can be used for risk analysis and testing.

STEP 2: Split the data into Training and Testing sets

split split2

STEP 3: Ensemble Learners: Here, you will train a Balanced Random Forest Classifier and an Easy Ensemble AdaBoost classifier to see which one gives better results.

  • Balanced Random Forest Classifier:

BRFC

  • List the features sorted in descending order by feature importance

Importance

  • Easy Ensemble AdaBoost Classifier

AdaBoost

ANALYSIS

Let us compare the various results:

Naive Oversampling Results

NOS

SMOTE Results

SMOTE

Cluster Centroid Results

Cluster

SMOTEENN Results

SMOTEENN

Easy Ensemble Results

ADABOOST

From these results, we notice very low precisions for High_Risk factor. This is an indication of a large number of False Positives.

When we look at EasyEnsembleClassifier model, we notice it has higher scores for High Risk loans and the balance accuracy is much higher than that of any other model.

Therefore, it can be concluded that EasyEnsembleClassifier models are the most effective of them all.

About


Languages

Language:Jupyter Notebook 100.0%