AlaGrine / Udacity_Sparkify_capstoneProject

Churn prediction and machine learning at scale with Pyspark.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sparkiy: churn prediction with Spark

DSND Capstone Project

Table of Contents

  1. Project Motivation
  2. Installation
  3. File Descriptions
  4. Instructions
  5. Results
  6. Acknowledgements

Project Motivation

This project is part of Udacity's Data Science Nanodegree Program.

The aim of this project is to build a binary classification model using Pyspark ML to predict churn for Sparkify.

Udacity provided a 12GB dataset of customer activity from Sparkify, a fictional music streaming service similar to Spotify. The dataset logs user interactions with the service, like listening to streaming songs, adding songs to playlists, thumbs up and down, etc.

Tiny (125MB) and medium (237MB) subsets of the full dataset are also provided.

PySpark, the Python API for Apache Spark, is used here on both the local machine and the AWS EMR cluster.

The project is divided into the following sections:

  1. Use the small subset (on a local machine) to perform exploratory data analysis and build a prototype machine learning model.
  2. Scale up: use the medium dataset (on a local machine) to see if our model works well on a larger dataset.
  3. Deploy a cluster in the cloud with AWS using the full 12GB dataset.

Installation

This project requires Python 3, Spark 3.4.1, and the following Python libraries installed:

Pyspark ,Pandas, Numpy, scipy, Plotly and Matplotlib

File Descriptions

The main file of the project is Sparkify.ipynb, which uses the small dataset and can therefore be run locally.

The project folder also contains the following:

  • Sparkify_medium.ipynb: The medium dataset which you can run locally.

  • metrics folder: The metrics of our models, including f1-score and training time, are available here (csv files).

  • statistics folder: Descriptive features of the main characteristics of small and medium datasets are available here (csv files).

  • AWS_EMR_bigData folder: It contains the inputs you can upload to your S3 bucket, and the outputs you can download from the same bucket.

    • My_script.py: The Python script to run on the EMR cluster. You will need to upload it to your S3 bucket first.

    • install-my-jupyter-libraries: A schell script you need to upload to your s3 bucket before creating your EMR cluster. When you create your cluster, add this script as a bootstrap action to install the required libraries.

    • S3_download folder: Contains the metrics of the model executed on the EMR cluster. downloaded from my s3 bucket.

Instructions

  1. All you need to do is unzip the JSON files provided by Udacity to run the code on your local machine.

  2. To run the Python script on the AWS EMR cluster, you need to submit the script to your cluster through the command line as follows:

    aws s3 cp s3://your_backet_name/My_script.py .

    spark-submit My_script.py

    The first command downloads the script from your S3 backet to the master machine for execution.

Results

I wrote a blog post about this project. You can find it here.

Acknowledgements

Must give credit to udacity for making this a wonderful learning experience.

About

Churn prediction and machine learning at scale with Pyspark.


Languages

Language:Jupyter Notebook 99.8%Language:Python 0.2%Language:Shell 0.0%