be-ns / simpsons_analysis

Predicting IMDB ratings from the scripts alone; A ratio-based approach to Recommenders

Home Page:https://thesimpsonian.club

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The Simpson's Analysis

Predicting IMDB ratings from the scripts of one of America's longest running shows.

Building a ratio-based Recommender for finding an episode you'd like.


TABLE OF CONTENTS

  1. Overview
  1. Technical Approach
  1. Future Steps

OVERVIEW

GOAL

The purpose of this project is twofold.

  1. I engineered a model using the scripts of an animated show to accuractely predict the public response to the show.
  2. I built a simple episode recommender using a ratio for preferred characters and locations as well as if the user enjoys music and politics in their animated shows.

Ultimately, a model like this could save the creators of television shows hundreds of thousands of dollars a year. A single episode of the Simpsons costs between $400,000 and $2mm between animation, voice acting, sound, and final production. If the writers utilized a tool like this IMDB predictor, they would be able to catch potentially low ratings before going into production and rework the episode.

PROCESS

Stacked IMDB Prediction Model

I engineered features from the 600 scripts of the Simpsons ranging from 1989 to 2016. Analysis was done on the text of the episode scripts to grab key episode data like words in the episode, character and location information, and the ratio of lines spoken by the core character group. Using this feature matrix coupled with the IMDB ratings, I built a stacked model using two Boosted Decision Trees (Adaboost and Gradient Boosting) that were optimized using a random parameter search. This model was run overnight on an EC2 instance, recursively pickling the model whenever a better hold-out error was achieved. This, combined with the random-search, was cross-validated to ensure the model was not resting at a local minimum.

Quick Visualization of How Gradient Boosting works

The Error is weighted each time and a step is taking in the negative direction of the gradient until the loss function is at the lowest it could be.
Recommender / Web App

The recommender I built is not a true recommmender in the sense of using distance metrics to compare similarity. This was chosen since the data available was only the scripts for the episodes. There were no user ratings, and since most people would not know an episode well enough to ask for similar episodes, I chose to look at it with a 'cold-start' mindset. For this reason a user inputs preferences such as a favorite location or if they like songs in their animated shows. The recommender then finds and sorts every episode by the selected preferences and returns the highest rated episode that meets their criteria. Since this process would take anywhere from 20-40 seconds, I built a hash-table using Python dictionaries and stored the results for every combination of preferences it was possible to have. The web-app shows a thumbnail of the suggested episode, along with the predicted and the actual values for the episode (using the stacked model from above), and a button to click and watch the episode. screenshot of web app

RESULTS

The Stacked Model using engineered features, random-search hyperparameter optimization and recursive attempts to outscore itself, achieved a RMSE of 0.351 on a 1-10 scale (RMSE is in the same units as the target number).

image

The Recommender is stored on Amazon Web Services and can be played around with here

TOOLS USED:


TECHNICAL APPROACH

DATA

Original Dataset was in CSV format, delineated so that each spoken line, whether 1 or 140 words long, had it's own line. It was found on Data World in four relational csv files (grouped either by location_id, character_id, or episode_id. Features included episode ID, season, and number in series (season four episode three and season 12 episode three would both have a number in series of 3). In a second dataset there was episode specific information like original air date, imdb rating, and viewership upon initial airing.

MUNGING / CLEANING

Initial data exploration was done in Pandas, Python, and MatPlotLib using a Jupyter Notebook. Data was cleaned in Python using a variety of packages, most heavily Pandas and Numpy. Script lines needed to have 37 rows manually cleaned due to double-quotation errors from in the text and raw_text sections NaNs were imputed with the column-mean. Given more time, I would like to improve this to my preferred method of K-Nearest Neighbor NaN imputation. Alternately, using a backfill method could be useful for timeseries data. The spoken lines (raw_text) was the only aspect of the script information used. I did not use screen direction or animation notes.

distribution of scores

The Data was split into a train and holdout set (80% / 20% breakdown). Models were built using the training set with the final model selected having the best hold-out RMSE. The algorithm was run in a while loop on an EC2 instance overnight, allowing the model to overwrite any saved models when the score improved. The overnight score went from .042 to 0.351; the model for the latter was saved (i.e. pickled). Persisting the model file allows me to skip the training step in the future.

MODEL SELECTION / BENCHMARKING

Data benchmarking was done with minor hyperparameter optimization for several algorithms. All models were trained on the training set (80% of original data) and scores below are for 3-fold cross-validated models for both training error and test error.

Benchmarking Train/Test error for regression algorithms.

models

For this graphic, the pickled Stacked Model was utilized, so the RMSE shown - below 0.3 - is not accurate to the holdout score on initial model training as this graph has minor data leakage.

The model selected was chosen due to the non-linear feature-to-target connections. Flexibility was highly valued, as was reduction in compute power. Although K-Nearest Neighbors was the quickest model, the test error showed it didn't improve to the extant that the stacked model did.

MODELING / ALGORITHMS

Stacking (also known as meta ensembling) is an ensemble modeling technique used to combine information from multiple predictive models to generate a new, better model. Often times the stacked model (or 2nd-level model) will outperform each of the individual models due its smoothing nature (reducing variance from overfitting) and the ability to highlight each base model where it performs best and discredit each base model where it performs poorly.

For my stacked model I chose to use an AdaBoost Decision Tree Regressor for the initial model, and a Gradient Boosted Decision Tree Regressor for the 2nd-level model.
Both of these are sequentially built decision trees which aim to minimize a loss function by weighting incorrect predictions and then rebuilding the decision tree with emphasis on those cases. Gradient Boosting does so by taking a step in the negative direction of the gradient (which is a fancy way of saying it tries to minimize the error to zero by going back and rebuilding itself while emphasizing certain data points). This method of stacking two models together resulted in a RMSE of 0.351, meaning my predicted IMDB ratings were off on average by .351 on a 0 to 10 scale.

ERROR METRIC CHOICE

I chose to evaluate my model using the Square Root of the Mean Squared Error, or RMSE. RMSE is calculated by finding the average of all the squared errors, then taking the square root. The resulting metric is a positive average error in the original metric of the equation (in this case, it is the amount my 1-10 scale prediction is off by).

Such that N = number of predictions made; Y-hat = Predicted score; Y = true score

MODEL - FEATURES

The final Stacked Model used engineered features to attempt to capture the signal for the Simpsons IMDB ratings over time. The top features in the model were the length of the lines (number of words in the average statement), the longest line in the episode, and the Simpson to Other ratio. This ratio was found by analyzing the percent of lines in the episode spoken by the core group of characters. A larger Simpson to Other Ratio implies that a larger percentage of the episode was spoken by the Simpson family. Another feature of note was the Political Cycle boolean, which was derived from analyzing the original air date to see if the episode aired during a political cycle or not. The Simpson's is often explicitly political in nature, which I thought may or may not influence the public response to the episode. Note the non-linear signal of the data, which lended itself nicely to the feature one feature two feature three top features

MODEL - RATIONALE

Why not NLP?
  1. At this stage, the data isn't consumer facing - it would be a raw script - utilizing engineered features would be easier to interpret than doing TF-IDF with PCA and more flexible than LDA. Engineered features would better serve the purpose here, since I could pull the top feaures from the stacked model.
Why not Parallelize?
  1. The algorithms I utilized are built sequentially, requiring previous knowledge to be built correctly. Parallelizing the data here would be impossible with this model.
  2. The models were small enough to fit in memory, allowing lower latency than spinning up clusters in the cloud.

NEXT STEPS

  1. Impute NaNs with KNN instead of column-mean.
  2. Experiment with Polynomial Expansion for feature space to see if this would improve accuracy.
  3. Alter political cycle metric to inlcude counts of political language.
  4. Give actionble insights from model for altering scripts to increase publice rating.
  5. Allow model to return top-N episodes instead of only the top episode.
  6. Run videos natively on Flask App.
  7. Pull Wikipedia description for every episode to be displayed on page.
  8. Compare stacked model with PyMC2 model.
  9. Forecast IMDB ratings using Facebook Prophet
  10. Compare my recommended episodes with the recommended episodes found on Simpsons World
  11. Add a 404 error page when the combination of features selected for recommender did not align with an episode.
  12. Build out a Bokeh or Plotly interactive graph with episode information overlaid on hover.


Special Thanks to Todd Schneider for his Simpsons by the Data analysis, which inspired the project

About

Predicting IMDB ratings from the scripts alone; A ratio-based approach to Recommenders

https://thesimpsonian.club


Languages

Language:Jupyter Notebook 93.9%Language:Python 3.3%Language:HTML 2.2%Language:JavaScript 0.3%Language:CSS 0.3%