PySpark MLlib Workshop

This repository contains a quick introduction to Apache Spark MLlib through two practical notebooks. The workshop is designed to provide hands-on experience with Spark's machine learning capabilities, focusing on Market Basket Analysis and Customer Churn Prediction.

Overview
Notebooks
- Market Basket Analysis
- Customer Churn Prediction
Resources
Installation
Docker and Docker-Compose Requirements
Customization
- Spark Cluster Settings
- Number of Nodes
Managing the Spark Cluster

Overview

This workshop provides a quick introduction to Spark MLlib. It includes two main notebooks:

Market Basket Analysis using Apache Spark: Focuses on the FP-Growth algorithm and Exploratory Data Analysis (EDA).
Customer Churn Prediction with PySpark MLlib: Focuses on Random Forest classification and the machine learning pipeline.

Notebooks

Market Basket Analysis

The first notebook demonstrates how to perform Market Basket Analysis using Apache Spark. It includes:

Loading and exploring the Instacart dataset.
Performing Exploratory Data Analysis (EDA).
Implementing the FP-Growth algorithm to find frequent itemsets and association rules.

Customer Churn Prediction

The second notebook walks through the process of predicting customer churn using PySpark MLlib. It includes:

Loading and preparing the Telco Churn dataset.
Building a machine learning pipeline with feature engineering.
Training and evaluating a Random Forest classifier.

Resources

Category	Content	Publisher
Book	Scaling Machine Learning with Spark By Adi Polak	O'Reilly
Spark Documentation	Machine Learning Library (MLlib) Guide	Spark Open Source
Coursera (Beginner)	Machine Learning Specialization	DeepLearning.AI
Coursera (Beginner)	IBM Data Science Professional Certificate	IBM
Coursera (Intermediate)	Machine Learning Specialization	University of Washington

Installation

To install the necessary dependencies, use Poetry. Run the following commands:

poetry install

Docker and Docker-Compose Requirements

Ensure you have Docker and Docker-Compose installed on your system. You can find installation instructions here:

Running on macOS using Colima

If you are using macOS, you can use Colima to run Docker. Colima is a container runtime that runs on macOS with minimal setup.

First, install Colima by following the instructions here.

To start Colima with 8 CPUs, 12 GB of RAM, and 30 GB of disk space, use the following command:

colima start --cpu 8 --memory 12 --disk 30

Customization

Spark Cluster Settings

The default Spark cluster settings are optimized to execute the FP-Growth algorithm. You can customize the driver and worker memory and number of cores to match your machine's specifications.

These settings are configured in the docker-compose.yml file:

services:
  spark:
    environment:
      - SPARK_DRIVER_MEMORY=3G
  spark-worker:
    environment:
      - SPARK_WORKER_MEMORY=3G
      - SPARK_WORKER_CORES=2

Number of Nodes

To change the number of Spark worker nodes, modify the --scale option in the makefile:

.PHONY: spark-up
spark-up:
	docker-compose up -d --scale spark-worker=3

Managing the Spark Cluster

You can use the makefile to easily create and destroy the Spark cluster.

Creating the Cluster

make spark-up

Destroying the Cluster

make spark-down

rodrigoalvamat / spark-mllib-workshop