Final exercise of the course "Data Analytics with Apache Spark" by Gaetano Fabiano
- Download IMDB Spoiler Dataset from Kaggle from https://www.kaggle.com/rmisra/imdb-spoiler-dataset
- Download IMDb Dataset from https://www.kaggle.com/ashirwadsangwan/imdb-dataset
- Move file in folder "in"
- Python
- Pyspark
-
The script main.py extract in a txt file the words that are most used in films which have a rating greater than mean value and with number of rating greater than 100
-
The script main_part_two.py apply k-means clustering to the films based on rating.