jphall663 / Hello-world

My first repository on GitHub.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Credit Line Increase Model Card

Basic Information

  • Person or organization developing model: Agnes, agnes@gmc.com

  • Model date: August, 2022

  • Model version: 1.0.2

  • License: MIT

  • Model implementation code: DNSC_6301_Project.ipynb

Intended Use

  • Primary intended uses: This model is an example probability of default classifier, with an example use case for determining eligibility for a credit line increase.
  • Primary intended users: Students in GWU DNSC 6301 bootcamp.
  • Out-of-scope use cases: Any use beyond an educational example is out-of-scope.

Training Data

  • Data dictionary:
Name Modeling Role Measurement Level Description
ID ID int unique row indentifier
LIMIT_BAL input float amount of previously awarded credit
SEX demographic information int 1 = male; 2 = female
RACE demographic information int 1 = hispanic; 2 = black; 3 = white; 4 = asian
EDUCATION demographic information int 1 = graduate school; 2 = university; 3 = high school; 4 = others
MARRIAGE demographic information int 1 = married; 2 = single; 3 = others
AGE demographic information int age in years
PAY_0, PAY_2 - PAY_6 inputs int history of past payment; PAY_0 = the repayment status in September, 2005; PAY_2 = the repayment status in August, 2005; ...; PAY_6 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; ...; 8 = payment delay for eight months; 9 = payment delay for nine months and above
BILL_AMT1 - BILL_AMT6 inputs float amount of bill statement; BILL_AMNT1 = amount of bill statement in September, 2005; BILL_AMT2 = amount of bill statement in August, 2005; ...; BILL_AMT6 = amount of bill statement in April, 2005
PAY_AMT1 - PAY_AMT6 inputs float amount of previous payment; PAY_AMT1 = amount paid in September, 2005; PAY_AMT2 = amount paid in August, 2005; ...; PAY_AMT6 = amount paid in April, 2005
DELINQ_NEXT target int whether a customer's next payment is delinquent (late), 1 = late; 0 = on-time
  • Source of training data: GWU Blackboard, email jphall@gwu.edu for more information
  • How training data was divided into training and validation data: 50% training, 25% validation, 25% test
  • Number of rows in training and validation data:
    • Training rows: 15,000
    • Validation rows: 7,500

Test Data

  • Source of test data: GWU Blackboard, email jphall@gwu.edu for more information
  • Number of rows in test data: 7,500
  • State any differences in columns between training and test data: None

Model details

  • Columns used as inputs in the final model: 'LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6'
  • Column(s) used as target(s) in the final model: 'DELINQ_NEXT'
  • Type of model: Decision Tree
  • Software used to implement the model: Python, scikit-learn
  • Version of the modeling software: 3.7.13, 1.0.2
  • Hyperparameters or other settings of your model:
DecisionTreeClassifier  {'ccp_alpha': 0.0,'class_weight': None,'criterion': 'gini',
                         'max_depth': 12,'max_features': None,'max_leaf_nodes': None,
                         'min_impurity_decrease': 0.0,'min_samples_leaf': 1,'min_samples_split': 2,
                         'min_weight_fraction_leaf': 0.0,'random_state': 12345,'splitter': 'best'}

Quantitative Analysis

Correlation Heatmap

image

Wider colors = positively correlated

darker colors = negatively correlated

So, when one variable goes up the other one goes down

i.e : there is a correlation between race and the outcome. There is a problem to figure out and fix.

This means people in certain race groups are not getting as many as other people

strong correlation between variables

  • Metrics used to evaluate the final model (AUC and AIR): confusion matrix

confusion matrix across race groups

Confusion matrix by RACE=1

                actual: 1 actual: 0
     predicted: 1       447       387
     predicted: 0       139       501
     (Hispanic)

Confusion matrix by RACE=2
             actual: 1 actual: 0
predicted: 1       449       348
predicted: 0       157       537
(Black)


Confusion matrix by RACE=3
             actual: 1 actual: 0
predicted: 1       176       813
predicted: 0        72      1228
(White)


Confusion matrix by RACE=4
             actual: 1 actual: 0
predicted: 1       186       784
predicted: 0        59      1217
(Asian)


White proportion accepted: 0.568
Hispanic proportion accepted: 0.434
hispanic-to-white AIR: 0.76

White proportion accepted: 0.568
Black proportion accepted: 0.465
black-to-white AIR: 0.82

White proportion accepted: 0.568
Asian proportion accepted: 0.568
asian-to-white AIR: 1.00

confusion matrix across sex groups

Confusion matrix by SEX=1
             actual: 1 actual: 0
predicted: 1       546       905
predicted: 0       179      1292
(Male)


Confusion matrix by SEX=2
             actual: 1 actual: 0
predicted: 1       712      1427
predicted: 0       248      2191
(Female)


Male proportion accepted: 0.503
Female proportion accepted: 0.533
female-to-male AIR: 1.06

confusion matrix across education groups

Confusion matrix by EDUCATION=1
             actual: 1 actual: 0
predicted: 1       367       766
predicted: 0       144      1359
(Graduate School)

Confusion matrix by EDUCATION=2
             actual: 1 actual: 0
predicted: 1       640      1115
predicted: 0       216      1551
(University)

Confusion matrix by EDUCATION=3
             actual: 1 actual: 0
predicted: 1       249       409
predicted: 0        65       496
(High School)

Confusion matrix by EDUCATION=4
             actual: 1 actual: 0
predicted: 1         0         9
predicted: 0         0        19
(Others)

Graduate School proportion accepted: 0.570
University proportion accepted: 0.502
university-to-graduate school AIR: 0.88

Graduate School proportion accepted: 0.570
High School proportion accepted: 0.460
high school-to-graduate school AIR: 0.81

Graduate School proportion accepted: 0.570
Others proportion accepted: 0.679
others-to-graduate school AIR: 1.19

confusion matrix across marriage groups

Confusion matrix by MARRIAGE=1
             actual: 1 actual: 0
predicted: 1       593      1004
predicted: 0       208      1573
(Married)

Confusion matrix by MARRIAGE=2
             actual: 1 actual: 0
predicted: 1       647      1293
predicted: 0       213      1878
(Single)

Confusion matrix by MARRIAGE=3
             actual: 1 actual: 0
predicted: 1        17        30
predicted: 0         6        29
(Others)

Married proportion accepted: 0.527
Married proportion accepted: 0.527
married-to-married AIR: 1.00

Married proportion accepted: 0.527
Single proportion accepted: 0.519
single-to-married AIR: 0.98

Married proportion accepted: 0.527
Others proportion accepted: 0.427
others-to-married AIR: 0.81

confusion matrix across age groups

Confusion matrix by AGE=40
             actual: 1 actual: 0
predicted: 1        39        59
predicted: 0        17       111
(Age In Years)
  • State the final values, neatly -- as bullets or a table, of the metrics for all data: training, validation, and test data
Training AUC Validation AUC Test AUC 5-Fold SD Hispanic-to-White AIR
1 0.645748 0.643880 0.639065 0.009275 0.894148
2 0.699912 0.687752 0.685590 0.012626 0.850871
3 0.742968 0.729490 0.728666 0.017375 0.799546
4 0.757178 0.741696 0.737322 0.017079 0.792435
5 0.769331 0.742480 0.739600 0.019886 0.829336
6 0.783722 0.749610 0.743847 0.017665 0.833205
7 0.795777 0.742115 0.737266 0.022466 0.835886
8 0.807291 0.739990 0.734446 0.015567 0.811300
9 0.822913 0.727224 0.728575 0.012042 0.811561
10 0.838052 0.720562 0.714933 0.013855 0.803621
11 0.855168 0.709864 0.702163 0.010405 0.837806
12 0.874251 0.688074 0.682614 0.00807 0.844889
  • Plots related to the data or final model

plot tree depth vs. training and validation AUC and AIR

image

Ethical considerations

  • Describe potential negative impacts of using your model:

    • Math or software problems: 70% accuracy rate, which means a 30% errors

    • Real-world risks: who, what, when or how: bias

  • Describe potential uncertainties relating to the impacts of using your model:

    • Math or software problems: need for ongoing monitoring as we don't know how the model will function
    • Real-world risks: who, what, when or how? Data privacy and security
  • Describe any unexpected or results: no missing values and PAY_0 being too important

About

My first repository on GitHub.

License:MIT License


Languages

Language:Jupyter Notebook 99.8%Language:Python 0.2%