s-arora1987 / Comparison_PySpark_df_RDD_mllatest_dataset_GCP_cluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Comparison of performance for collborative filtering reccomender training using PySpark df and PySpark RDD on a GCP cluster

Table of Contents

Motivation

PySpark offers two options for storing a data table: Pandas like data frame structure (not exactly same as Pandas df), resilient distributed dataset (RDD) data structure. When it comes to simple low level actions on tables in a dataset, it is known that PySpark data frames are faster than RDD [1,2]. But that information alone does not tell much about how they compare in higher-level modeling actions, in terms of overall training error and training time for predictive model (e.g. recommender systems). Do dataframes enable faster training and better learning accuracy than RDD? Also, are they fast in making prediction post training?

Libraries Used

I used following libraries for building this project:

  • pandas
  • pyspark 2.3.2
  • time
  • math
  • matplotlib

Files

The notebook established a 'Comparison_PySpark_df_RDD_mllatest_dataset_GCP_cluster.ipynb' implements the code that compares training performance of a dataframe based collaborative-filtering recommender model with that of RDD based model.

Summary of Results

The results showed that both the training and the prediction of recommender system using dataframe were considerably faster as compared to ALS using RDD, without any compromise in learning accuracy.

RMSE

TrainingTime

For further details of implementation and results, please see the medium blog post Efficiency comparison between collaborative filtering using PySpark data frames and that using PySpark RDD"

Contact

sa08751@uga.edu

Acknowledgements

About


Languages

Language:Jupyter Notebook 100.0%