Loan Default Prediction

Data set from https://www.kaggle.com/datasets/yasserh/loan-default-dataset/data

This dataset is Loan Default. The aim is to tune a model to predict whether a loan will default. The challenge with this problem is that the dataset is highly imbalanced. The traditional method will not work very well and we cannot rely on accuracy as the only metric. In addition, False Negative (those who are actually defaulters who were predicted as not default) is more important than those who are False Positive (those who are non-default who were predicted as defaulters).

We try to predict loan default using Machine Learning techniques. The steps we took were :

Clean the dataset
1. Remove N/As
2. Remove redundant variables
3. Remove non-informative variables (IDs, Year)
Split Test Train dataset
SMOTE to oversample the data
OLS, LASSO and Ridge for Regularization
KNN
Random Forest
Random Forest and LASSO
Adaboost
Gradient Boosting
XGBoost
XGBoost with adjusting thresholds

In these models, we also cross-validated the hyperparameters (such as lambda in LASSO and number of trees in Random Forest).

About

Loan Default Prediction

machine-learning supervised-learning

MIT License

Languages

Language:Jupyter Notebook 88.0%Language:Python 12.0%