kenuiuc / dsci_522_group_20

backup repo for dsci 522 group project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Credit Card Default Predictor

  • authors: Arjun Radhakrishnan, Morris Zhao, Fujie Sun, Ken Wang
  • contributors: TBD

Introduction

Research Question

For this project we are trying to answer the question:

Given a credit card customer's payment history and demographic information like gender, age, and education level, would the customer default on the next bill payment?"

Answering this question is important because, with an effective predictive model, financial institutions can evaluate a customer's credit level and grant appropriate credit amount limits. This analysis would be crucial in credit score calculation and risk management.

Data Set

The dataset this project uses is the default of credit card payment by clients, created by I-Cheng Yeh from (1) Department of Information Management, Chung Hua University, Taiwan. and (2) Department of Civil Engineering, Tamkang University, Taiwan. The data file is available on UCI Machine Learning Repository. Each row of this data includes:

  • Default payment (Yes = 1, No = 0), as the response variable
  • Monthly bill statements in the last 6 months
  • Monthly payment status in the last 6 months(on time, 1 month delay, 2 month delay, etc)
  • Monthly payment amount in the last 6 months
  • Credit amount
  • Gender
  • Education
  • Martial status
  • Age

Analysis Approach

Given that this is a binary classification problem and we have both categorical and continuous numeric features, we plan to build different models including logistic regression, support vector classifier, kNN classifier, and random forest. We carried out cross-validation for each model, optimize their hyper-parameters and compare their performance using multiple evaluation metrics. Given that the sample data is imbalanced with about a 20% default rate, accuracy might not be a good enough scoring method to use. We included other metrics like precision/recall, and f1-score.

Initial EDA

So far we have performed some basic exploratory data analysis which can be found here. The main observations are:

  • Target data is imbalanced so we need extra efforts in choosing appropriate evaluation metrics and applying class-weight to our models.

  • There are strong correlations among multiple numeric features (April payment amount and May payment amount for example), which indicates we might have to drop some of those features to improve model performance.

Report

We present the analysis report using a markdown file. The final report can be found here

Usage

To reproduce this analysis you will need to:

  • Clone this Github repo:
git clone git@github.com:UBC-MDS/credit_default_prediction_group_20.git
  • Install the dependencies listed below

  • Create the environment

conda env create -f environment.yaml
  • To activate the created environment, execute the below command:
    conda activate credit_default_predict
  • Download the data: Run the python script to download the data from UCI ML repository:
python ./src/download_data_from_url.py --url "https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls" --download_path "./data/raw" --file_name "credit_default_data" --file_type "xlsx"
  • Run the following commands in the same order to preprocess the data, gerenerate EDA artifacts, ML artifacts, and finally final report.
# preprocessing
python ./src/data_processing.py --input_path='data/raw/credit_default_data.csv' --out_dir='data/processed' [--test_size=<test_size>]

# EDA
python ./src/eda_script.py --processed_data_path='data/processed' --eda_result_path='results/eda'

# tune models
python ./src/fit_credit_default_predict_models.py --read_training_path='data/processed/credit_train_df.csv'[--write_model_path='results/trained_models'] [--write_score_path='results/cross_validation_reuslts.csv']

# test models
python ./src/model_summary.py --model_dir='results/trained _models'  --test_data='data/processed/credit_test_df.csv' --output_dir='results/model_summary'

# render final report
Rscript -e "rmarkdown::render('doc/credit_default_analysis_report.Rmd', output_format = 'github_document')"

Dependencies

For overall list of dependencies, please check here

  • Python 3.10.6 and Python packages:
    • xlrd>=2.0.1
    • xlwt>=1.3.0
    • ipykernel>=6.16.0
    • mglearn>=0.1.9
    • altair_saver>=0.5.0
    • vega_datasets>=0.9.0
    • docopt=0.6.2
    • ipython>=7.15
    • selenium<4.3.0
    • matplotlib>=3.2.2
    • scikit-learn>=1.0
    • pandas>=1.3.*
    • requests>=2.24.0
    • joblib==1.1.0
    • psutil>=5.7.2
    • openpyxl>=3.0.0

License

The Credit Card Default Predictor materials here are licensed under MIT License. If re-using/re-mixing please provide attribution and link to this webpage.

Reference

Credit Card Default Data from UCI ML Repository. Name: I-Cheng Yeh email addresses: (1) icyeh '@' chu.edu.tw (2) 140910 '@' mail.tku.edu.tw institutions: (1) Department of Information Management, Chung Hua University, Taiwan. (2) Department of Civil Engineering, Tamkang University, Taiwan. other contact information: 886-2-26215656 ext. 3181

This readme and project proposal file follows the format of the README.md in the Breast Cancer Predictor project.

About

backup repo for dsci 522 group project

License:MIT License


Languages

Language:Jupyter Notebook 75.9%Language:Python 23.4%Language:Makefile 0.7%