ETL on Cloud Data Lake for Song Play Analysis

This project aims to load raw song and user data, process and save data as star schema for later analysis by Elastic Map-Reduce (EMR) service. This is also used to satisfied with Data Lake project under Data Engineer Nanodegree Program.

Prerequisite

Python3
Python virtual environment (aka venv)
AWS credentials/config files under ~/.aws directories.

Steps

Bootstrap virtual environment with dependencies

$ python3 -m venv ./venv
$ source ./venv/bin/activate
$ pip install -r requirements.txt

Copy config template template.dl.cfg to dl.cfg and aws_stuff/template.common.sh to aws_stuff/common.sh.
```
$ cp ./template.dl.cfg ./dl.cfg
$ cp ./aws_stuff/template.common.sh ./aws_stuff/common.sh
```

Fill dl.cfg on ETL_PROCESSED_DATA_SET section. It refers to target S3 bucket to store processed data set. Here are possible values.

[ETL_PROCESSED_DATA_SET]
BUCKET_NAME=sample-data-lake-bucket
USER_DATA_PREFIX=data-lake/user
ARTIST_DATA_PREFIX=data-lake/artist
TIME_DATA_PREFIX=data-lake/time
SONG_DATA_PREFIX=data-lake/song
SONGPLAY_DATA_PREFIX=data-lake/songplay

Fill aws_stuff/common.sh on cluster_name, key_pair_file_name, subnet_id, log_uri and pem_file_path. Here are possible values.

# can be anything as your choice
cluster_name="tony-emr-cluster"

# S3 location to store EMR logs in as your choice
log_uri="s3://sample-emr-cluster-log/"

# https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html
key_pair_file_name="sample-ec2-emr-key-pair"
pem_file_path="${HOME}/.aws/sample-ec2-emr-key-pair.pem"

# default subnet ID
# https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html#create-default-vpc
subnet_id="sample-subnet-id"

Spin up EMR cluster.

$ cd ./aws_stuff
$ ./create_emr_cluster.sh

Look for cluster ID from previous result. Then, put it to aws_stuff/common.sh. Here is a possible value.
```
cluster_id="sample-cluster-id"
```
Retrieve public domain name of master node from EMR AWS console. Then, put it to aws_stuff/common.sh. Here is a possible value.
```
master_public_dns="sample-master-node.compute.amazonaws.com"
```

Upload etl.py and dl.cfg to master node.

$ cd ./aws_stuff
$ ./upload_etl_stuff.sh

SSH to master node and submit etl.py script via spark-submit command.
```
$ spark-submit --master yarn etl.py
```
Terminate EMR cluster after used.

$ cd ./aws_stuff
$ ./terminate_emr_cluster.sh

talerngpong / data-lake

ETL on Cloud Data Lake for Song Play Analysis

Prerequisite

Steps

About

Languages