s-arora1987/Comparison_PySpark_df_RDD_mllatest_dataset_GCP_cluster

Comparison of performance for collborative filtering reccomender training using PySpark df and PySpark RDD on a GCP cluster

Motivation
Libraries Used
Files
Summary of Results
Contact
Acknowledgements

Motivation

PySpark offers two options for storing a data table: Pandas like data frame structure (not exactly same as Pandas df), resilient distributed dataset (RDD) data structure. When it comes to simple low level actions on tables in a dataset, it is known that PySpark data frames are faster than RDD [1,2]. But that information alone does not tell much about how they compare in higher-level modeling actions, in terms of overall training error and training time for predictive model (e.g. recommender systems). Do dataframes enable faster training and better learning accuracy than RDD? Also, are they fast in making prediction post training?

Libraries Used

I used following libraries for building this project:

pandas
pyspark 2.3.2
time
math
matplotlib

Files

The notebook established a 'Comparison_PySpark_df_RDD_mllatest_dataset_GCP_cluster.ipynb' implements the code that compares training performance of a dataframe based collaborative-filtering recommender model with that of RDD based model.

Summary of Results

The results showed that both the training and the prediction of recommender system using dataframe were considerably faster as compared to ALS using RDD, without any compromise in learning accuracy.

For further details of implementation and results, please see the medium blog post Efficiency comparison between collaborative filtering using PySpark data frames and that using PySpark RDD"

Contact

sa08751@uga.edu

s-arora1987 / Comparison_PySpark_df_RDD_mllatest_dataset_GCP_cluster

Comparison of performance for collborative filtering reccomender training using PySpark df and PySpark RDD on a GCP cluster

Table of Contents

Motivation

Libraries Used

Files

Summary of Results

Contact

Acknowledgements

About

Languages