Building an ETL pipeline with Python.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
To run and preview the dataset you will need:
Python <3.6
Amazon Redshift
Psycopg2
To create the tables and load the data from the JSON files into the Redshift DataWarehouse. Run the following commands:
python create_tables.py
python etl.py
To test the pipeline. Run:
jupyter notebooks
#open the test.ipynb then run the cells
- subqueries.py: contains all the ETL Pipeline queries
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.
The log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
The design adopts a star schema with dimensional tables(users, songs, artists, time) and facts table(songplays). The database is designed to optimize queries by denormalizing the tables to simplify queries and reduce joins. The DB is designed to better understand user activities like the songs they play and the artists they listen to.
- songplays: Records in log data associated with song plays
- users: Users in the app
- songs: Songs in database
- artists: Artists in database
- time: Timestamps of events broken down into specific units
- Victor A.