jamesdellinger / machine_learning_nanodegree_capstone_project

Solving the Home Credit Default Risk competition on Kaggle while it was live and still ongoing. Final project for Udacity's Machine Learning Engineer Nanodegree.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Machine Learning Capstone Project

Applying machine learning algorithms and techniques to submit a solution to the Home Credit Default Risk competition on kaggle, while the competition was live.

For Udacity's Machine Learning Engineer Nanodegree.

Topic: Specialization.

Overview

  • Participated in the Home Credit Default Risk kaggle competition during June of 2018, and earned a public leaderboard score of 0.74111.
  • After finishing this project writeup, I kept refining my algorithm for the next two months, and was ultimately able to achieve a final private leaderboard score of 0.79506 when the competition ended on August 29, 2018.
  • This put my solo submission inside the top 8% and was good enough to earn me a bronze medal in the competition.
  • My best performing kernel is here on kaggle.

Concepts

  • I applied extensive exploration and preprocessing to the competition's dataset.
  • Compared the performance various learning models/featuresets/dimensionality reduction.
  • Engineered a handful of new features.
  • Ultimately got best results by fitting a LightGBM model to the full featureset, in order predict which borrowers are most likely to have difficulty repaying their loans.

My Capstone Project Report

My Capstone Project Proposal

My Competition Solution Code

Project Grading and Evaluation

Home Credit Default Risk Competition Synopsis

The goal of the Home Credit Default Risk competition on Kaggle is the creation of a machine learning algorithm that is able to predict the likelihood that an loan applicant will make at least one late payment when repaying their loan. The competition is sponsored by Home Credit, whose mission is to provide a positive and safe borrowing experience to groups of people that traditional, mainstream banks and financial institutions typically refuse to serve.

Home Credit targets a demographic that typically has no recourse but to deal with shady characters such as loan sharks when borrowing money. Many of these unbanked individuals are hard-working, well-intentioned folks who, either due to circumstances beyond their control or past mistakes, have fallen through the financial system’s cracks.

Home Credit needs an algorithm that will take as inputs various personal and alternative financial information originally taken from a loan applicant's profile, and then determine a probability of the applicant eventually becoming delinquent. This probability will be in the range [0.0, 1.0], where 1.0 represents a 100% certainty that the applicant will make at least one late payment and 0.0 indicates that there is zero chance that the applicant will ever be delinquent. The algorithm will be tested on a set of 48,744 individuals who previously borrowed from Home Credit. A CSV file must be produced that contains one header row, and 48,744 prediction rows, where each prediction row contains both a user ID, the SKI_ID_CURR column, and the probability, the TARGET column, of that user repaying their loan. The file must be formatted as follows:

SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.

Home Credit knows which borrowers ultimately made at least one late payment, and which ones were never delinquent. A good algorithm will need to predict a high probability of delinquency for the majority of borrowers who did make a late payment. This algorithm will also need to predict a low probability of delinquency for the majority of borrowers who always paid on time.

The scoring metric for submissions is area under the ROC curve. The best performing submissions are ranked on the competition's leaderboard.

Competition Data Tables

https://www.kaggle.com/c/home-credit-default-risk/data

In order to reproduce my results, the following eight CSV files must be downloaded and unzipped inside the /data directory:

  1. application_test.csv.zip
  2. application_train.csv.zip
  3. bureau.csv.zip
  4. bureau_balance.csv.zip
  5. credit_card_balance.csv.zip
  6. installments_payments.csv.zip
  7. POS_CASH_balance.csv.zip
  8. previous_application.csv.zip

Dependencies

About

Solving the Home Credit Default Risk competition on Kaggle while it was live and still ongoing. Final project for Udacity's Machine Learning Engineer Nanodegree.


Languages

Language:Jupyter Notebook 68.9%Language:HTML 30.6%Language:TeX 0.5%Language:Python 0.0%