Spark on AWS EMR - Data lake project

This project exercises the use of:

Spark SQL queries to parallelise data import from AWS S3
AWS EMR using Spark in YARN mode

Pre-requisites

AWS CLI

This will enable us to be authenticated in AWS. The installation guide can be found here.

EC2 Key pair

This key will be used to SSH into the Spark master to submit Spark scripts.

In the EC2 console > create a new key pair (PEM format).

/!\ Do not loose or share this key.

Provision the Spark cluster

aws emr create-cluster \
    --name sparkcluster \
    --use-default-roles \
    --release-label emr-5.28.0 \
    --instance-count 2 \
    --applications Name=Spark \
    --ec2-attributes KeyName=[keypair-name] \
    --instance-type m5.xlarge \
    --instance-count 3 --auto-terminate`

Parquet-tools

This utility can be used to visualise the Parquet files saved by our ETL process.

On MacOS:

brew install parquet-tools

Usage:

parquet-tools cat part-00015-699b443b-1a3f-4144-804d-af0c537fcc4c.c000.snappy.parquet
song_id = SOHKNRJ12A6701D1F8
title = Drop of Rain
duration = 189.57016

Usage

Structure & configuration

The project includes one main file etl.py. The ETL process:

uses Python functions to connect to AWS EMR and load staging tables from S3
uses schema-on-read techniques to structure the data in fact/dimensions tables
save these tables back into S3 in parquet files

To interact with AWS EMR and S3, you will need to create a configuration file dl.cfg and supply:

the credentials to access AWS EMR
S3 locations used for the datasets

Rename dl.cfg-dist into dl.cfg in the root folder and provide the following configurations:

AWS_ACCESS_KEY_ID=''
AWS_SECRET_ACCESS_KEY=''
S3_INPUT="s3a://[bucket-name]/"
S3_OUTPUT="s3a://[bucket-name]/"

ETL-ing

Once configured, execute the following commands to run the ETL process in Spark:

scp -i ~/[ec2-keypair-file].pem etl.py hadoop@[spark-master-IP]:~
ssh -i ~/[ec2-keypair-file].pem hadoop@[spark-master-IP]
spark-submit --master yarn etl.py

Datasets

The project datasets are based on a subset from the Million Song Dataset.

A small subset of the data is available in the data folder.

Songs data

You will need to store the data in a S3 bucket with the following structure:

Song data: s3://mybucket/song_data
Log data: s3://mybucket/log_data
Log data json path: s3://mybucket/log_json_path.json

Each file is in JSON format and contains metadata about a song and the artist of that song.

The files are structured in a hierarchy of folders that follow the alphabetical order of the first three letters of each song's track ID.

For example:

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

A single song file has the following form:

{
    "artist_id": "ARJNIUY12298900C91",
    "artist_latitude": null,
    "artist_location": "",
    "artist_longitude": null,
    "artist_name": "Adelitas Way",
    "duration": 213.9424,
    "num_songs": 1,
    "song_id": "SOBLFFE12AF72AA5BA",
    "title": "Scream",
    "year": 2009
}

Log Dataset

The logs dataset is generated from this event simulator based on the songs in the song dataset. It will be used as our simulation of activity logs from users listening to songs.

The log files are partitioned by year and month. For example,

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

A single log file has the following form:

{
...
}
{
    "artist": null,
    "auth": "Logged In",
    "firstName": "Walter",
    "gender": "M",
    "itemInSession": 0,
    "lastName": "Frye",
    "length": null,
    "level": "free",
    "location": "San Francisco-Oakland-Hayward, CA",
    "method": "GET",
    "page": "Home",
    "registration": 1540919166796.0,
    "sessionId": 38,
    "song": null,
    "status": 200,
    "ts": 1541105830796,
    "userAgent": "\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"",
    "userId": "39"
}
....
{
...
}
....

We will use the log_json_path.json JSONPath manifest to inform Redshift on how to import the JSON data files with a COPY from S3 command.

{
    "jsonpaths": [
        "$['artist']",
        "$['auth']",
        "$['firstName']",
        "$['gender']",
        "$['itemInSession']",
        "$['lastName']",
        "$['length']",
        "$['level']",
        "$['location']",
        "$['method']",
        "$['page']",
        "$['registration']",
        "$['sessionId']",
        "$['song']",
        "$['status']",
        "$['ts']",
        "$['userAgent']",
        "$['userId']"
    ]
}

willgarcia / spark-aws-emr

Spark on AWS EMR - Data lake project

Pre-requisites

Usage

Structure & configuration

ETL-ing

Datasets

Songs data

Log Dataset

Data model

Star schema

Spark specificities

About

Languages