zhiruiwang / Sparkify_Churn_Prediction

Use PySpark to predicte churn of a music streaming website

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sparkify_Churn_Prediction

Table of Contents

  1. Installation
  2. Project Motivation
  3. Files Description
  4. Result

Installation

This project uses the following Python libraries:

Pyspark

itertools

re

h2o

Matplotlib

You will also need to have software installed to run and execute a Jupyter Notebook.

If you do not have Python installed yet, it is highly recommended that you install the Anaconda distribution of Python, which already has the above packages and more included. And for Spark, I am using Databricks, but you can also do this using AWS or IBM Cloud.

Project Motivation

This is udacity's capstone project, using spark to analyze user behavior data from music app Sparkify.

Sparkify is a music app, this dataset contains two months of sparkify user behavior log. The log contains some basic information about the user as well as information about a single action. A user can contain many entries. In the data, a part of the user is churned, through the cancellation of the account behavior can be distinguished.

Files Description

Sprakify .html Databricks notebook, main file of the project, it demonstrates the process of using pyspark to explore the data and build the model.

Sprakify .ipynb Jupyter notebook output from Databricks, if you want to use local, AWS or IBM cloud to run, this file would be better to start. But since the notebook is designed for Databricks, viewing it in GitHub will be very messy!

medium-sparkify-event-data.json Input data of the workflow, can be found in following link

Result

This project defined customer churn as to whether a user visited the Cancellation Confirmation page. We used a sample dataset in a Databricks pyspark cluster that can be easily scaled to much larger datasets (big data). We trained a XGBoost model on the engineered features, which give us AUC of 0.84 in the testing data. The variable importance plot is inline with the exploratory analysis we did before implementing the model.

I post a blog about the detail, you can find it here.

About

Use PySpark to predicte churn of a music streaming website


Languages

Language:HTML 66.1%Language:Jupyter Notebook 33.9%