cinneesol / telco-customer-churn-in-r-and-h2o

Showcase for using H2O and R for churn prediction (inspired by ZhouFang928 examples)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Showcase: telco customer churn prediction with GNU R and H2O

Showcase for using H2O and R for churn prediction (inspired by ZhouFang928 examples).

ZhouFang928 in a blog post Telco Customer Churn with R in SQL Server 2016 presented a great analysis of telco customer churn prediction. I found it missed one of my favorite machine-learning library H2O in the comparison. This showcase presents how easy it is to use H2O library to build very good quality predictive models.

Prerequisities

I have used R version 3.2.3 with the following R packages:

Remark for Windows users

Instalation of the packages requires Rtools compatible with your R version.

Usage instruction

  1. Install packages by running source("install_packages.R")
  2. Train and evaluate model by running source("build_telco_churn_model.R") 3. After succesful model building you can find it (in H2O format) in folder export. It can be loaded in H2O Flow for further inspection.

Approach

I decided to go with Gradient Boosting Models. To select best model I used grid search for such parameters:

  • number of trees: 50, 100, 500
  • max tree depth: 4, 8, 16, 32

Best model was selected using AUC metric -- resulting in 100 trees with max depth equals 16. After model building I optimized threshold to maximize minimum per class accuracy.

Obtained results

Best model (with threshold selected to maximize min per class classification error) gave following results on test dataset:

  • AUC = 0.947
  • Accuracy = 0.866
  • Precision = 0.395
  • Recall = 0.875

Performance issues

Computation involved validating (using 5-fold cross validation) 6 GBM models with different parameters. On my laptop (Intel i7, 8GB RAM, Windows 10) it took around 25 minutes. Using Amazon's EC2 c4.4xlarge instance the time droped to around 14-15 minutes.

Good practices

  1. Always install packages for each project separately.
  2. Select best model with any parametr tunning procedure.
  3. Do not forget to optimize threshold.

Project structure description

Project structure

Folders:

  • data - this folder contains CSV file with customers' info. It is a copy of data from ZhouFang928's example.
  • libs - this folder contains packages installed by install_packages.R
  • export - this folder is for saving computing results (currently final model is stored there)

Files:

  • install_packages.R - R script that installs packages into local libs folder
  • build_telco_churn_model.R - R script that does the thing
  • find_best_model.R - utility function that does grid search and returns best model with the optimal threshold.

About

Showcase for using H2O and R for churn prediction (inspired by ZhouFang928 examples)

License:Apache License 2.0


Languages

Language:R 100.0%