davidweisscode / churn-prediction-sparkify

Machine learning project modeling user churn of a hypothetical music streaming service

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

churn-prediction-sparkify

Machine learning project modeling user churn of a hypothetical music streaming service

Data Science Nanodegree - Capstone Project: PySpark Customer Churn Prediction for the Sparkify Music Streaming Service

Head over to Medium to read my blogpost at

https://davidweisspost.medium.com/churn-prediction-with-pyspark-52ddece92ba4

The Capstone project for Udacity's Data Scientist Nanodegree. This project involves predicting Customer Churn for a hypothetical music streaming app Sparkify, using Spark's MLlib to engineer features and build a classification model. The dataset used here is a medium-sized (248 MB, with 544,000 rows) version of the whole dataset (which is 12 GB).
This project is worked on IBM Cloud's Watson Studio, uploading the data cluster, with a Python 3.7/Spark 3.0 enabled Jupyter Notebook.

Using pyspark, the project broadly involves the following:

  • Loading and cleaning the data
  • Exploratory Data Analysis
  • Feature Engineering - appropriate features are selected based on the EDA
  • Modelling - two different classification models are tested and evaluated
  • Model Tuning - Hyperparameter tuning using grid search
  • Concluding Remarks

Installation

  • Python 3.6+
  • pyspark.*
  • Jupyter - available through this link, or IBM Watson Studio (Lite)

About

Machine learning project modeling user churn of a hypothetical music streaming service


Languages

Language:Jupyter Notebook 100.0%