hamadalaqeel / data-lake-aws

Project 4 for Data Engineering Nanodegree

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Introduction

The purpose of this project is to build an ETL pipeline that will be able to extract song and log data from an S3 bucket, process the data using Spark and load the data back into s3 as a set of dimensional tables in spark parquet files. This helps analysts to continue finding insights on what their users are listening to.

Database Schema Design

Our Database looks like the following

Instructions

  • Add appropriate AWS IAM Credentials in dl.cfg
  • Specify desired output data path in the main function of etl.py
  • Run etl.py

About

Project 4 for Data Engineering Nanodegree


Languages

Language:Python 100.0%