CarlosUziel / sparkify-spark

A project from the Data Engineer Nanodegree Program at Udacity to practice Data Lakes and ETL pipelines using AWS and Apache Spark.

Repository from Github https://github.comCarlosUziel/sparkify-sparkRepository from Github https://github.comCarlosUziel/sparkify-spark

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Additional Notes
  5. License
  6. Contact
  7. Acknowledgments

Sparkify (using Spark and AWS EMCR)

A project from the Data Engineer Nanodegree Program at Udacity to practice data lakes on the cloud using Spark and AWS services.

About The Project

Premise

A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to.

You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.

(back to top)

Goal

The goal of this project is to apply what I have learned on Spark and data lakes to build an ETL pipeline for a data lake hosted on S3. Data will be loaded from an S3 bucket, processed into dimensional and fact tables and stored into another S3 bucket.

(back to top)

Data

Data is stored in an S3 bucket:

  • Song data: s3://udacity-dend/song_data
  • Log data: s3://udacity-dend/log_data

Log data json path: s3://udacity-dend/log_json_path.json

The song dataset

A subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. Files are located under in s3://udacity-dend/song_data.

For example, this is how the first file (TRAAAAW128F429D538.json) looks like:

{
  "num_songs": 1,
  "artist_id": "ARD7TVE1187B99BFB1",
  "artist_latitude": null,
  "artist_longitude": null,
  "artist_location": "California - LA",
  "artist_name": "Casual",
  "song_id": "SOMZWCG12A8C13C480",
  "title": "I Didn't Mean To",
  "duration": 218.93179,
  "year": 0
}

Log dataset

It is composed of log files in NDJSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations. Files are located under s3://udacity-dend/log_data.

The log files are named following a date pattern ({year}_{month}_{day}_events.json), and below is the first line of the first file (2018-11-01-events) as an example:

{
  "artist": null,
  "auth": "Logged In",
  "firstName": "Walter",
  "gender": "M",
  "itemInSession": 0,
  "lastName": "Frye",
  "length": null,
  "level": "free",
  "location": "San Francisco-Oakland-Hayward, CA",
  "method": "GET",
  "page": "Home",
  "registration": 1540919166796.0,
  "sessionId": 38,
  "song": null,
  "status": 200,
  "ts": 1541105830796,
  "userAgent": "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36\"",
  "userId": "39"
}

(back to top)

Data Schema

The new tables follow a star schema that is better visualized through an Entity Relationship Diagram (ERD). The following image shows the ERD of an example PostreSQL database with the final tables defined.

Sparkify ERD

Here, songplays is a fact table, whereas artists, songs, time and users are dimension tables. These tables make it easy to query relevant information with few joins.

(back to top)

Getting Started

To make use of this project, I recommend managing the required dependencies with Anaconda.

Setting up a conda environment

Install miniconda:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Install mamba:

conda install -n base -c conda-forge mamba

Install environment using provided file:

mamba env create -f environment.yml # alternatively use environment_core.yml if base system is not debian
mamba activate sparkify_spark

Setting up an Amazon EMR cluster

Create an IAM user:

  1. IAM service is a global service, meaning newly created IAM users are not restricted to a specific region by default.
  2. Go to AWS IAM service and click on the "Add user" button to create a new IAM user in your AWS account.
  3. Choose a name of your choice.
  4. Select "Programmatic access" as the access type. Click Next.
  5. Choose the Attach existing policies directly tab, and select the "AdministratorAccess" (Only for testing purposes!). Click Next.
  6. Skip adding any tags. Click Next.
  7. Review and create the user. It will show you a pair of access key ID and secret.
  8. Take note of the pair of access key ID and secret. This pair is collectively known as Access key.

Save access key and secret locally:

  1. Create a new file, _user.cfg, and add the following:
AWS_ACCESS_KEY_ID = <YOUR_AWS_KEY>
AWS_SECRET_ACCESS_KEY = <YOUR_AWS_SECRET>
  1. This file will be loaded internally to connect to AWS and perform various operations.
  2. DO NOT SHARE THIS FILE WITH ANYONE! I recommend adding this file to .gitignore to avoid accidentally pushing it to a git repository: printf "\n_user.cfg\n" >> .gitignore.

Create cluster:

Since the python scripts that are part of this project are meant to be run within an EMR cluster, the easiest way is to create the EMR cluster using the AWS UI. Check this great guide on how to do it using boto3.

Create cluster with the following settings (using advanced options):

  • Release: emr-6.8.0 or later.
  • Applications: Hadoop 3.2.1, JupyterEnterpriseGateway 2.1.0, and Spark 3.3.0 (or later versions).
  • Instance type: m3.xlarge.
  • Number of instances: 3.
  • EC2 key pair: Proceed without an EC2 key pair or feel free to use one if you'd like to.

It is a good idea to set cluster auto-termination on (e.g. to one hour after being idle).

DO NOT FORGET TO TERMINATE YOUR CLUSTER WHEN FINISHED WORKING ON THE PROJECT TO AVOID UNWANTED COSTS!

Usage

Project structure:

  • notebooks: contains the main Jupyter notebook to run the project (notebooks/main.ipynb).
  • src: contains the source files and scripts to build and populate the Data Lake.

Ensure you have set the PYTHONPATH environment variable as needed (e.g., PYTHONPATH=~/sparkify_spark/src)

The whole project can be run as follows:

python src/etl.py

Alternatively, follow along notebooks/main.ipynb.

(back to top)

Additional Notes

Source files formatted using the following commands:

isort .
autoflake -r --in-place --remove-unused-variable --remove-all-unused-imports --ignore-init-module-imports .
black .

License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

Contact

Carlos Uziel Pérez Malla

GitHub - Google Scholar - LinkedIn - Twitter

(back to top)

Acknowledgments

This README includes a summary of the official project description provided to the students of the Data Engineer Nanodegree Program at Udacity.

(back to top)

About

A project from the Data Engineer Nanodegree Program at Udacity to practice Data Lakes and ETL pipelines using AWS and Apache Spark.


Languages

Language:Python 80.0%Language:Jupyter Notebook 20.0%