ottobricks / default_prediction_case_study

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Case Study | Default Prediction

1. Problem definition

By and large, the case study asks that we predict the probability that a given user will default their next payment. Altough it would have been nice to work with customers' time series data, we are provided with pre-computed variables that describe such series.

2. Metrics for our business goal

In order to guarantee a smooth experience for our customers, we score and evaluate our models via threshold analysis. That is, the model outputs predictions for the likelyhood of customers defaulting and we follow up with an ad-hoc selection of thresholds to decide whether to flag observations. Such thresholds will be selected based on two key performance indicators:

  1. Flag as "in risk of default" no more than 5% of customers incorrectly.
  2. Maximize the number of clients correctly flagged as "in risk of default"

3. Experiment Overview

  1. Sanity Check - check consistency, missing values and macro behaviour.
  2. Exploration - test hypothesis and select variables
  3. Feature Engineering - create preprocessor module
  4. Baseline Model - score a simple baseline model
  5. Tuned Model - search hyperparameters and models
  6. Results and Evaluation - score best model and present results

Each part of the experiment will have it's own dedicated folder for the sake of clarity.

4. Results

Throughtout our experiment, we show that testing hypothsesis and thinking about your data are fundamental steps so as to have any chance at success in a prediction task. This was most sucessfully highlighted during our custom undersample strategy to handle class imbalance in our target label that took our baseline performance to a much better level.

In the end, we fit a model that respects our 1st KPI and only blocks 2.3% of users incorrectly but is able to identify nearly 30% of defaults correctly (at predicted_risk > 0.85). Whilst this result are far from great, we believe it achieves the goal of having a reasonable enough model built over (a long) weekend. Of course, much more effort can go towards hyperparameter tunning to squeeze any potential gains and choosing different models, but we believe exploration to be more relevant in the context of a case study, thus more time was spent on it.

All in all, we believe this is a fun project with some challenges with regards to data quality that enables one to come up with creative solutions. Not all steps are perfect but major experiment decisions were given a good amount of thought, considering the short time span of this exercise. We (I, myself and my coffee mug) would be happy to discuss both technical and philosophical details of our methods and implementation.

5. Deployment

We will make our model available through AWS Lambda, which is available at:

https://s1rgig9qnh.execute-api.eu-west-2.amazonaws.com/dev/api/v1/default_risk/predict

The expected payload is a json containing at least the following items:

{
    "headers": {
        "Authorization": "otto-case-study"
    },
    "data": """'[{"uuid":"1229c83c-6338-4c4b-a20f-065ecca45b4a",
                  "account_amount_added_12_24m":28472,
                  "account_days_in_dc_12_24m":0.0,
                  "account_days_in_rem_12_24m":0.0,
                  "account_days_in_term_12_24m":0.0,
                  "account_incoming_debt_vs_paid_0_24m":0.0,
                  "account_status":1.0,
                  "account_worst_status_0_3m":1.0,
                  "account_worst_status_12_24m":1.0,
                  "account_worst_status_3_6m":1.0,
                  "account_worst_status_6_12m":1.0,
                  "age":29,
                  "avg_payment_span_0_12m":8.24,
                  "avg_payment_span_0_3m":7.8333333333,
                  "merchant_category":"Diversified electronics",
                  "merchant_group":"Electronics",
                  "has_paid":true,
                  "max_paid_inv_0_12m":37770.0,
                  "max_paid_inv_0_24m":37770.0,
                  "name_in_email":"F1+L",
                  "num_active_div_by_paid_inv_0_12m":0.037037037,
                  "num_active_inv":1,
                  "num_arch_dc_0_12m":0,
                  "num_arch_dc_12_24m":0,
                  "num_arch_ok_0_12m":25,
                  "num_arch_ok_12_24m":16,
                  "num_arch_rem_0_12m":0,
                  "num_arch_written_off_0_12m":0.0,
                  "num_arch_written_off_12_24m":0.0,
                  "num_unpaid_bills":1,
                  "status_last_archived_0_24m":1,
                  "status_2nd_last_archived_0_24m":1,
                  "status_3rd_last_archived_0_24m":1,
                  "status_max_archived_0_6_months":1,
                  "status_max_archived_0_12_months":1,
                  "status_max_archived_0_24_months":1,
                  "recovery_debt":0,
                  "sum_capital_paid_account_0_12m":116,
                  "sum_capital_paid_account_12_24m":27874,
                  "sum_paid_inv_0_12m":265347,
                  "time_hours":14.1708333333,
                  "worst_status_active_inv":1.0}]'"""
}

It's quite easy to generate the expected data format with Pandas. All it takes is:

import pandas as pd

pd.read_csv("dataset.csv").to_json(orient="records")

The route is capable of handling multiple requests at once or one at a time. The output is a string in the same json format as the input, and can be easily transformed into back into a DataFrame:

pd.read_json(response.content, orient="records")

6. Running the Experiment

Before anything else, you must create a data/ directory on the top-level dir and add the dataset.csv file for it all to work.

It's quite simple to run the project. First you must navigate to the top-level directory and build the Docker image:

docker build -t default-case-study -f ./Dockerfile .

Then, run it with:

docker run default-case-study:latest

This will generate all asstes within the container. They can either be copied or the jupyter server can be expose. Otherwise, if you want to run the notebooks without fussing around with docker, you can take the following steps. First you must navigate to the top-level of the project and install the package manager Poetry:

python -m pip install poetry

Then, let poetry do the heavy lifting (it may take a little while):

python -m poetry install

And that's it. You may now spin jupyter and explore the notebooks for yourslef:

poetry run jupyter notebook ./

About


Languages

Language:HTML 83.2%Language:Jupyter Notebook 16.8%Language:Python 0.1%Language:Dockerfile 0.0%