Movies_ETL

Use ETL to create database that can be analyzed to predict movie popularity.

Objective

Prepare a dataset for a hackathon using the ETL process. Gather data from both Wikipedia and Kaggle, combine them, and save them into a SQL database so that the hackathon participants have a nice, clean dataset to use.

Resources:

Downloaded wikipedia.movies.jason Downloaded from Kaggle the-movies-dataset zip folder Unzipped the folder and used 2 of the files: movies_metadata.csv and ratings.csv

Steps Taken in the ETL Process

Inspected the data to determine any problems that may need to be addressed int the transformation process.
Filled in missing data
- substituted data from another data frame that had good data
- interpolated between existing data points
- extrapolated from existing data
Normalized the data
- reshaped the data
- converted data types
- parsed text data to the correct format
- split columns using regex

Used during entire transformation process:

- list comprehensions to filter data
- created functions to perform the cleaning process
- lambda functions
- regular expressions, regex
- parsing
- histogram
- scatter plots
- connected pandas to a SQL database by creating a Database engine

Images generated to compare similar values between the data sets:

Histogram:

Scatter Plots: Realease Dates

wiki_box_office vs kaggle_revenue

wiki_budget vs kaggle_budget

wiki_running_time vs kaggle_runtime ![alt text](https://github.com/Al-Huneidi/Movies_ETL/blob/master/Screenshots/wiki_running_time%7Ckagggle_runtime.png

Verification data copied to movies table in SQL database:

Al-Huneidi / Movies_ETL

Movies_ETL

Objective

Resources:

Steps Taken in the ETL Process

About

Languages