Table of Contents

About The Project
- Premise
- Goal
- Data
Getting Started
Usage
Additional Notes
License
Contact
Acknowledgments

Sparkify (using Apache Airflow)

A project from the Data Engineer Nanodegree Program at Udacity to practice data pipelines using Apache Airflow as well as S3 buckets and Amazon Redshift.

About The Project

Premise

A music streaming company, Sparkify, has decided that it is time to introduce more automation and monitoring to their data warehouse ETL pipelines and come to the conclusion that the best tool to achieve this is Apache Airflow.

They have decided to bring you into the project and expect you to create high grade data pipelines that are dynamic and built from reusable tasks, can be monitored, and allow easy backfills. They have also noted that the data quality plays a big part when analyses are executed on top the data warehouse and want to run tests against their datasets after the ETL steps have been executed to catch any discrepancies in the datasets.

The source data resides in S3 and needs to be processed in Sparkify's data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.

(back to top)

Goal

The goal of this project is to apply what I have learned on Apache Airflow to create my own custom operators to perform tasks such as staging the data, filling the data warehouse, and running checks on the data as the final step. The final airflow DAG should look similar to the image below.

(back to top)

Data

Data is stored in an S3 bucket in us-west-2:

Song data: s3://udacity-dend/song_data
Log data: s3://udacity-dend/log_data

The song dataset

A subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. Files are located under in s3://udacity-dend/song_data.

For example, this is how the first file (TRAAAAW128F429D538.json) looks like:

{
  "num_songs": 1,
  "artist_id": "ARD7TVE1187B99BFB1",
  "artist_latitude": null,
  "artist_longitude": null,
  "artist_location": "California - LA",
  "artist_name": "Casual",
  "song_id": "SOMZWCG12A8C13C480",
  "title": "I Didn't Mean To",
  "duration": 218.93179,
  "year": 0
}

Log dataset

It is composed of log files in NDJSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations. Files are located under s3://udacity-dend/log_data.

The log files are named following a date pattern ({year}_{month}_{day}_events.json), and below is the first line of the first file (2018-11-01-events) as an example:

{
  "artist": null,
  "auth": "Logged In",
  "firstName": "Walter",
  "gender": "M",
  "itemInSession": 0,
  "lastName": "Frye",
  "length": null,
  "level": "free",
  "location": "San Francisco-Oakland-Hayward, CA",
  "method": "GET",
  "page": "Home",
  "registration": 1540919166796.0,
  "sessionId": 38,
  "song": null,
  "status": 200,
  "ts": 1541105830796,
  "userAgent": "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36\"",
  "userId": "39"
}

(back to top)

Data Schema

The new tables follow a star schema that is better visualized through an Entity Relationship Diagram (ERD). The following image shows the ERD of an example PostgreSQL database with the final tables defined.

Here, songplays is a fact table, whereas artists, songs, time and users are dimension tables. These tables make it easy to query relevant information with few joins.

(back to top)

Getting Started

To make use of this project, I recommend managing the required dependencies with Anaconda.

Setting up a conda environment

Install miniconda:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Install mamba:

conda install -n base -c conda-forge mamba

Install environment using provided file:

mamba env create -f environment.yml # alternatively use environment_core.yml if base system is not debian
mamba activate sparkify_airflow

Setting up a local Apache Airflow server

To start a local Apache Airflow server for the purposes of this project, simply run the following:

bash initialize_airflow.sh

Introduce your desired password when prompted and then access the UI at localhost:8080 with user admin and the password you just created.

Setting up an Amazon Redshift cluster

Create an IAM user:

IAM service is a global service, meaning newly created IAM users are not restricted to a specific region by default.
Go to AWS IAM service and click on the "Add user" button to create a new IAM user in your AWS account.
Choose a name of your choice.
Select "Programmatic access" as the access type. Click Next.
Choose the Attach existing policies directly tab, and select the "AdministratorAccess". Click Next.
Skip adding any tags. Click Next.
Review and create the user. It will show you a pair of access key ID and secret.
Take note of the pair of access key ID and secret. This pair is collectively known as Access key.

Save access key and secret locally:

Create a new file, _user.cfg, and add the following:

AWS_ACCESS_KEY_ID = <YOUR_AWS_KEY>
AWS_SECRET_ACCESS_KEY = <YOUR_AWS_SECRET>

This file will be loaded internally to connect to AWS and perform various operations.
DO NOT SHARE THIS FILE WITH ANYONE! I recommend adding this file to .gitignore to avoid accidentally pushing it to a git repository: printf "\n_user.cfg\n" >> .gitignore.

Create cluster:

Fill the dwh.cfg configuration file. These are the basic parameters that will be used to operate on AWS. More concretely, GENERAL covers general parameters, DWH includes the necessary information to create and connect to the Redshift cluster and S3 contains information on where to find the source dataset for this project. This file is already filled with example values.
To create the Redshift cluster, simply run the setup.py python script (must be done after initialize_airflow.sh, since registration of connections is also taking place in setup.py).

DO NOT FORGET TO TERMINATE YOUR REDSHIFT CLUSTER WHEN FINISHED WORKING ON THE PROJECT TO AVOID UNWANTED COSTS!

(back to top)

Usage

Project structure:

src/dags: Airflow dags.
src/plugins: Airflow custom plugins and operators.
src/*.py: Utility scripts and functions.

Ensure you have set the PYTHONPATH environment variable as needed (e.g., PYTHONPATH=~/sparkify_airflow/src)

The whole project can be run as follows:

bash initialize_airflow.sh && python src/setup.py

(back to top)

Additional Notes

Source files formatted using the following commands:

isort .
autoflake -r --in-place --remove-unused-variable --remove-all-unused-imports --ignore-init-module-imports .
black .

License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

Contact

Carlos Uziel Pérez Malla

GitHub - Google Scholar - LinkedIn - Twitter

(back to top)

Acknowledgments

This README includes a summary of the official project description provided to the students of the Data Engineer Nanodegree Program at Udacity.

(back to top)

CarlosUziel / sparkify-airflow

Sparkify (using Apache Airflow)

About The Project

Premise

Goal

Data

The song dataset

Log dataset

Data Schema

Getting Started

Setting up a conda environment

Setting up a local Apache Airflow server

Setting up an Amazon Redshift cluster

Usage

Additional Notes

License

Contact

Acknowledgments

About

Languages