This project is an implementation of a probability of default model, which is used to predict the likelihood that a borrower will default on a loan. The model is trained on a dataset of historical loan data and can be used to make predictions on new loan applications. The Logistic model was created using fine and coarse classing utilizing weight of evidence and information values.
The model was evaluated using the ROC curve, Gini coefficient, and Kolmogorov-Smirnov test. The final output is an easy-to-use scorecard that ranges from 300 to 850.
- Data
- Preprocessing
- Feature Engineering
- Model Training
- Model Evaluation
- Scorecard
- Contributing
- License
The dataset used for this project is sourced from LendingClub and consists of historical data on loans issued by the platform. The data includes information such as loan amount, interest rate, borrower income, and credit score, as well as whether the loan was fully paid off or defaulted.
The dataset can be found here: Dataset
Before training the model, the data is preprocessed to ensure that it is in the correct format and that any missing values are handled appropriately. The preprocessing steps include:
- Dropping irrelevant columns
- Handling missing values
- Converting categorical features into numerical features using one-hot encoding
- Scaling numerical features
The preprocessing code can be found in the preprocessing.py
file.
In addition to preprocessing, we also perform feature engineering to create new features that may be more informative than the original features. The feature engineering steps include:
- Creating new features based on the original features (e.g., debt-to-income ratio)
- Removing redundant features
- Using domain knowledge to create new features (e.g., calculating the number of delinquent accounts)
The feature engineering code can be found in the PD_Model.ipynb
file.
The model is trained using a logistic regression algorithm, which is a popular algorithm for binary classification problems such as this one. The training code can be found in the PD_Model.ipynb
file.
The trained model is evaluated on the testing set to assess its performance. The evaluation metrics used include the ROC curve, Gini coefficient, and Kolmogorov-Smirnov test. The evaluation code and results can be found in the PD_Model.ipynb
file.
The final output is an easy-to-use scorecard that ranges from 300 to 850. Fine and coarse classing with weight of evidence and information values were used to create the binning for the variables on the scorecard.
If you would like to contribute to this project, feel free to submit a pull request or open an issue.
This project is licensed under the MIT License.