docker-spark

Overview

This is a learner project to understand how to ingest a semi-structured data and query with Spark.

Run the init-spark script to deploy a docker container of Hadoop cluster server and Spark
Build the Java based Spark application to a Jar file
Docker cp the Jar file and CSV dataset to Spark container
Run Jar to process CSV dataset
Read results

Tech stack

Docker
Spark
Java
Gradle

data

This folder consists of a CSV dataset that describes the total attendance group by medical institutions and year.

spark

This folder consists of a Spark application that will process the CSV dataset to return the total attendance group by medical institutions.

init-spark shell script

This is a script that will git clone the Spark docker GitHub project, deploy a docker container of Spark.

Prerequsites

Download and install Docker. Follow the below guides.

https://docs.docker.com/install

How to run

Start your docker daemon

This is really depend on your OS. For my case, it is just starting the Docker app.

Deploy Spark container

This will deploy the docker container holding Spark.

./init-spark.sh

Build the Spark application

Use your favorite IDE and build the jar in the spark folder.

# go to the output jar folder
zip -d spark.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF

Copy the Jar and dataset into the Hadoop + Spark container

# Go to data folder
docker cp hospital-and-outpatient-attendances.csv \
<spark_server_container_id>:hospital-and-outpatient-attendances.csv

# Go to spark folder
docker cp spark.jar <spark_server_container_id>:spark.jar

Process the dataset and enjoy the output results

# Get into the Spark container
docker exec -it <spark_server_container_id> bash

# Process the dataset
java -cp spark.jar SparkApplication hospital-and-outpatient-attendances.csv

Housekeeping

Here are some housekeeping tips if you are on a low memory resource machine like me.

# This is to have a clean state of your docker environment
docker stop $(docker ps -a -q) && \
docker system prune -a

TODO

Create and integrate a REST API
Extract the output result to the REST API

panggd / docker-spark