Use ETL to create database that can be analyzed to predict movie popularity.
Prepare a dataset for a hackathon using the ETL process. Gather data from both Wikipedia and Kaggle, combine them, and save them into a SQL database so that the hackathon participants have a nice, clean dataset to use.
Downloaded wikipedia.movies.jason Downloaded from Kaggle the-movies-dataset zip folder Unzipped the folder and used 2 of the files: movies_metadata.csv and ratings.csv
-
Inspected the data to determine any problems that may need to be addressed int the transformation process.
-
Filled in missing data
- substituted data from another data frame that had good data
- interpolated between existing data points
- extrapolated from existing data
-
Normalized the data
- reshaped the data
- converted data types
- parsed text data to the correct format
- split columns using regex
Used during entire transformation process:
- list comprehensions to filter data
- created functions to perform the cleaning process
- lambda functions
- regular expressions, regex
- parsing
- histogram
- scatter plots
- connected pandas to a SQL database by creating a Database engine
Images generated to compare similar values between the data sets:
wiki_box_office vs kaggle_revenue
wiki_running_time vs kaggle_runtime ![alt text](https://github.com/Al-Huneidi/Movies_ETL/blob/master/Screenshots/wiki_running_time%7Ckagggle_runtime.png