This repository contains a quick introduction to Apache Spark MLlib through two practical notebooks. The workshop is designed to provide hands-on experience with Spark's machine learning capabilities, focusing on Market Basket Analysis and Customer Churn Prediction.
- Overview
- Notebooks
- Resources
- Installation
- Docker and Docker-Compose Requirements
- Customization
- Managing the Spark Cluster
This workshop provides a quick introduction to Spark MLlib. It includes two main notebooks:
- Market Basket Analysis using Apache Spark: Focuses on the FP-Growth algorithm and Exploratory Data Analysis (EDA).
- Customer Churn Prediction with PySpark MLlib: Focuses on Random Forest classification and the machine learning pipeline.
The first notebook demonstrates how to perform Market Basket Analysis using Apache Spark. It includes:
- Loading and exploring the Instacart dataset.
- Performing Exploratory Data Analysis (EDA).
- Implementing the FP-Growth algorithm to find frequent itemsets and association rules.
The second notebook walks through the process of predicting customer churn using PySpark MLlib. It includes:
- Loading and preparing the Telco Churn dataset.
- Building a machine learning pipeline with feature engineering.
- Training and evaluating a Random Forest classifier.
Category | Content | Publisher |
---|---|---|
Book | Scaling Machine Learning with Spark By Adi Polak | O'Reilly |
Spark Documentation | Machine Learning Library (MLlib) Guide | Spark Open Source |
Coursera (Beginner) | Machine Learning Specialization | DeepLearning.AI |
Coursera (Beginner) | IBM Data Science Professional Certificate | IBM |
Coursera (Intermediate) | Machine Learning Specialization | University of Washington |
To install the necessary dependencies, use Poetry. Run the following commands:
poetry install
Ensure you have Docker and Docker-Compose installed on your system. You can find installation instructions here:
If you are using macOS, you can use Colima to run Docker. Colima is a container runtime that runs on macOS with minimal setup.
First, install Colima by following the instructions here.
To start Colima with 8 CPUs, 12 GB of RAM, and 30 GB of disk space, use the following command:
colima start --cpu 8 --memory 12 --disk 30
The default Spark cluster settings are optimized to execute the FP-Growth algorithm. You can customize the driver and worker memory and number of cores to match your machine's specifications.
These settings are configured in the docker-compose.yml
file:
services:
spark:
environment:
- SPARK_DRIVER_MEMORY=3G
spark-worker:
environment:
- SPARK_WORKER_MEMORY=3G
- SPARK_WORKER_CORES=2
To change the number of Spark worker nodes, modify the --scale option in the makefile:
.PHONY: spark-up
spark-up:
docker-compose up -d --scale spark-worker=3
You can use the makefile to easily create and destroy the Spark cluster.
make spark-up
make spark-down