churn-user-prediction data-exploration feature-engineering gradient-boosted-trees linear-regression machine-learning random-forest sparkmllib

Sparkify: User churn prediction

Installation

Apart from Anaconda distribution of Python, this code requires pyspark either in standalone or in clustered environment for execution.

Project Motivation

Predicting churn rates is a challenging and common problem that data scientists and analysts regularly encounter in any customer-facing business. In this notebook, Sparkify mini dataset has been used to perform analysis on the contents of the data and further build a model based on spark ML libraries in order to predict user churn.

Folder Structure

Sparkify.ipynb
- Containts all code for data cleaning, data exploration, modelling and conclusions.

Feature Engineering

Following features were used for the model

Average Session length
Number of Platforms used by the user
Number of artists
Number of Thumbs Up
NUmber of Thumbs Down
Number of Sessions
Number of days since registration
Gender
Platform
Level of subscription
Churn (label)
Downgraded

Modelling

Following models were tried based on the features that were created from the dataset after cleaning and exploration.

Logistic Regression
Gradient Boosting Trees
Random Forest Classifier

Out of the above models that were tried GBT performs the best, followed by RFC and LR models with 86%, 83% and 79% F1 scores respectively.

Results

The main findings can be found on the blog post here

Licensing, Authors, and Acknowledgements

Udacity

About

churn-user-prediction data-exploration feature-engineering gradient-boosted-trees linear-regression machine-learning random-forest sparkmllib

MIT License

Languages

Language:HTML 76.4%Language:Jupyter Notebook 23.6%

prodo56 / Sparkify-predicting-churn