evpirnia / scrapy-movies

Scrapy-IMDB

Benchmarking a Big Data System Dataset Source: https://grouplens.org/datasets/movielens/20m/

Project Details: https://lipyeow.github.io/ics421s17/morea/project/experience-project.html

Only using the following files from the dataset above:

movies.csv
ratings.csv
tags.csv

List of Queries:

Select all distinct movie titles where Zach Galifianakis is written as a tag
Select all movie titles and the average rating with the tag "visually appealing"
Count the number of records where the movie rating is greater than or equal to 4.5 and the movie id is greater than or equal to 96000
Select all distinct movie titles where the genre has the phrase "Act"

Data Processing Systems to be Benchmarked:

Spark: http://spark.apache.org/
Hive: https://hive.apache.org/

Setup and using clusters on Google Cloud Platform:

Results

Hive Output (corresponding to query number):

18 rows selected (33.878 seconds)
197 rows selected (53.175 seconds)
1 row selected (30.086 seconds)
1646 rows selected (33.01 seconds)

Spark Output (corresponding to query number):

Time taken: 19.296 seconds, Fetched 18 row(s)
Time taken: 28.803 seconds, Fetched 197 row(s)
Time taken: 17.308 seconds, Fetched 1 row(s)
Time taken: 22.904 seconds, Fetched 1646 row(s)

Hive CPU Utilization:

Hive Disk Utilization:

Spark CPU Utilization:

Spark Disk Utilization:

About