etl etl-pipeline json jupyter-notebook pgadmin postgresql python

                                                                                       Michelle Werner (6/1/2022)

ETL: Extract, Transform, and Load

Creating consistent, robust data storage systems with data relationships that maintain data integrity and efficient querying performance is what the ETL process strives for. In this project, data was extracted from two different web sources, then supplied for cleaning, to be loaded into a new database. The cleaning process was intense.

(Pictured: "ETL: Extract Transform and Load")

About the project:

Amazing Prime Video is the world's largest online retailer. They also have a platform that streams movies and tv shows. They have sponsored a hackathon to see if data nerds can figure out a way to loop through video data and identify low-budget films that will have demand - so they can purchase the streaming rights on the cheap.

Data sources supplied include:

Wikipedia info all movies released since 1990 (in JSON format)
MovieLens rating data from Kaggle (in multiple csvs)

MovieLens is a website run by the GroupLens research group at the University of Minnesota. The Kaggle dataset pulls from the MovieLens dataset of over 20 million reviews and contains a metadata file with details about the movies from The Movie Database (TMDb). Wikipedia has a ton of information about movies, including budgets and box office returns, cast and crew, production and distribution, and so much more. Amazing Prime Hackathon organizer Britta supplied these data sources.

(Pictured: Module 8 Hackathon Graphic)

The first task was to inspect them all and get going in Jupyter Notebook... and to eventually export clean data to a new postgreSQL DB.

Check out this repository to see how it went!

About

ETL exercise on combining and cleaning movie data from Wikipedia and Kaggle into PostgreSQL using Python and SQL.

etl etl-pipeline json jupyter-notebook pgadmin postgresql python

Languages

Language:Jupyter Notebook 99.8%Language:Python 0.2%