Streaming Viewing Analysis

Introduction

In this notebook, I'll take a look at my streaming viewing history data from Hulu, Netflix, and Prime Video accounts. I will merge these datasets with an additional dataset that contains more information about the streaming titles like run time, genres, and Imdb score. This will enable me to gain more insight into my streaming history.

Changes to the Project Plan

I turned in my Project Plan when I only had my Netflix data. Since then, I've able to obtain my Hulu and Prime Video data.

My original idea was to compare my streaming data to the top streaming titles but I wasn't able to find a dataset that worked with my data.

Project Challenges

I realized (too late for this project) that Netflix and Prime Video create a new "watch event" every time a stream is paused and played again. This means that my data shows individual titles with artificially high watch counts. I made the decision not to remove the duplicates because doing so would have removed genuine new watch events and I would rather go through the datasets myself to determine which ones to delete.

I removed more data from my streaming viewing history than I would have liked in order to combine them. While Hulu had the best data because it did not count each pause as a new "watch event", the downside was the data came split over multiple PDFs and would have required far more time to clean. I made the decision to stick with the basic data from each streaming service on the basis of time.

I struggled finding an additional dataset that worked with my combined streaming viewing history data. In the end, I stuck with the one that gave me the largest merged dataset. I know I could have used just the three streaming datasets but I wanted the challenge of merging additional data.

Datasets Used

Personal Streaming Viewing History

Hulu
Netflix
Prime Video

Additional dataset

Kaggle - Netflix, Movies, and Popularity

How to Run this Project

Using a Virtual Environment

Clone this repo git clone https://github.com/istarlet/streaming_analysis.git
Create a new folder in the cloned repo called datasets
Download the datasets here and add the downloaded datasets to the datasets folder See note
CD into cloned project folder
Install virtual venv if you don't already have it installed pip install virtualenv
Activate the virtual environment (see intructions here)
Install the requirements.txt file pip install -r requirements.txt
Then run these Juptyer Notebook files: Hulu, Imdb, Netflix, Prime Video BEFORE running the main project notebook Streaming Data

(Note: If you downloaded the dataset as a .zip file, make sure to add the individual datasets to the new datasets folder and not the folder they were zipped in.)

Activating a Virtual Environment

On Mac/Linux

Open the Terminal and create a virtual environment with the command python3 -m venv virtual-env

Activate the virtual environment with the command source virtual-env/bin/activate

On Windows

Open the Command Prompt and create a virtual environment with the command python -m venv virtual-env

Activate the virtual environment with the command virtual-env\Scripts\activate.bat

Deactivate the Virtual Environment

Typedeactivate

Python packages used in this project:

datetime
matplotlib
pandas
seaborn

Project Requirements

1. Loading data - Feature 1

Read TWO data files (JSON, CSV, Excel, etc.).

I read in four CSV files.

AshleyViewingActivity.csv
DigitalPrimeVideoViewingHistory.csv
HuluViewingHistoryUpdated.csv
titles.csv

2. Clean and operate on the data while combining them - Feature 2

Clean your data and perform a pandas merge with your two data sets, then calculate some new values based on the new data set.

I cleaned the each dataset in their own Jupyter Notebook.

Hulu Imdb Netflix Prime Video

I then concatonated the Hulu, Netflix, and Prime Video datasets together.

With my new combined streaming dataset, I merged it with the Imdb dataset.

I added new columns to the merged dataset by extracting the day, month, and hour from the "Date Watched" column.

3. Visualize / Present your data - Feature 3

Make 3 matplotlib or seaborn (or another plotting library) visualizations to display your data.

4. Best practices - Feature 4

Utilize a virtual environment and include instructions in your README on how the user should set one up

I created a virtual environment with instructions in the How to Run this Project section.

5. Interpretation of your data - Feature 5

Annotate your code with markdown cells in Jupyter Notebook, write clear code comments, and have a well-written README.md.

In my Jupyter Notebooks, I annotated my code with markdown cells and wrote clear code comments. I have included a README.md.

istarlet / streaming_viewing_analysis