jzallen07 / Churn_Classification_App

Scalable churn classifier using pyspark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Churn Classification App:

Project Overview

Predicting churn rates is a common and potentially difficult problem that many data scientists face across many customer-facing industries. It is necessary for businesses to identify those customer segments or even individuals that are likely to churn both for efforts to win those customers back to the service and so as to inform future marketing programs in finding quality customers.

To this end, this project examines a large customer data set using Spark backed machine learning to model and predict customer churn. The resulting model was has been placed into a web application and full write-up of the finding can be found here.

Problem Statement

The central goal here is to make a prediction about whether or not a customer is about to churn and for that prediction to inform efforts to retain that user as a customer. Specifically this prediction will be a binary output of user status. This will be accomplished by examining the data at hand, feature engineering, building a classification model, evaluation of model performance, and finally integration of the model into a web application.

Metrics

The initial dataset analysis shows us that the dataset is imbalanced (see section Input Data): there are more than 3 times fewer users, who churned, than other users. That is why I can’t use accuracy (which is the number of correct predictions divided by the total number of predictions) as a metric to evaluate the resulting model. In our case, we should care about both types of errors: false negatives and false positives because in case of false negative we can miss the customer who is going to churn and lose the customer and in case of false positive we can have unnecessary costs on retaining the customer who was not going to churn. That is why as a metric to evaluate the model I chose F1 score because it equally considers both the precision and the recall.

Good practice will be followed and all modeling will be “validated” against a withheld test set of data. All initial models will be evaluated using 'Accuracy', 'Precision', 'Recall', 'F-Score' across 5-fold cross validation. As this is a relatively imbalanced data set accuracy will be of limited use and F score will be the primary metric used to evaluate overall model performance though precision and recall will also be of import.

Data:

This project is based on the medium sized version of the versions of the Sparkify data set. The largest weighing in at 12Gb can be found on AWS. These vary principally in the number of user entries they contain.

The included features:

# Column Type Description
1 userId string Unique user identifier
2 artist string Name of the artist
3 auth string “Logged-in” or “Cancelled”
4 firstName string First name of the user
5 gender string User Gender, “F” or “M”
6 itemInSession bigint Item in session
7 lastName string Last name of the user
8 length double Length of the song related to the event
9 level string Level of the user’s subscription, “free” or “paid”. User can change the level, so events for the same user can have different levels
10 location string Location of the user at the time of the event
11 method string “GET” or “PUT”
12 page string Type of action: “NextSong”, “Login”, “Thumbs Up” etc.
13 registration bigint
14 sessionId bigint
15 song string Name of the song related to the event
16 status bigint Response status: 200, 404, 307
17 ts bigint Event Timestamp
18 userAgent string System which the user was interacting with the platformt through

Web Application:

  • Flask back-end
  • Bootstrap controls of front-end. The web application consists of the following parts:
  • Python script build_model.py which builds the machine learning model. This script accepts the path to the dataset and the path where the resulting model should be saved as parameters.
  • Saved machine learning model, pretrained model, which is generated by build_model.py. The application loads the model and uses it to make predictions.
  • Application (Flask) script run.py , which runs starts the application and renders web pages. The script loads the model on start and applies it to make predictions out of the data provided by the user on the web page.
  • Web page templates master.html and go.html of application web pages. Pages use bootstrap controls. The web application allows the user to enter the information about the customer and then tells whether the customer is about to churn based on this information.

Web App Start-up:

Setup Instructions:

To run the web application follow steps:

  1. Ensure required libraries are installed (see below)
  2. Run build_model.py $ python create_model.py ../mini_sparkify_event_data.json ../classifier
  3. From the app folder run run.py: $ cd ../app $ python run.py
  4. Open the app http://0.0.0.0:3001/ in a web browser.

Conclusions & Results Summary:

Did we achieve the goal of increasing customer retention. Maybe, maybe not. A tool such as this one could certainly be of use across many industries, but in and of itself it will not even come close to solving the problem. Models like this one are but part of the equation. They are very powerful and informative tools but, at present, that is all they are. They will not be proactively solving complex issues such as this one on their own just yet. If a model such as this one shows a customer is likely to churn the remainder of the work lies with people and or other tool systems to reach out to that customer.

  • As with most data projects the majority of the work with this one is data wrangling and feature engineering. How to best take an amalgamation of customer actions and meta data and turn that into a profile for predicting future behavior of other people.

  • Several classification methods were tried and evaluated using a variety of metrics, principally F1, and a single underlying model (Random Forrest) was used to power the Flask app.

  • The final section of the project was building a web based user interface to provide inputs and receive output to and from the model. This is currently only limited to single entry predictions and is not scalable to large data sets but that will be a feature addition for the future.

Future Improvements:

  • Introduce a method for submission of large volumes of customer data into the model.
  • Train a model of the full 12Gb data set.
  • Enhance the modeling either through stacked models or neural nets.

Requirements:

  • Python 3.6.5
  • Pyspark & SparkML
  • Flask

About

Scalable churn classifier using pyspark


Languages

Language:HTML 59.1%Language:Jupyter Notebook 40.2%Language:Python 0.8%