MahsaShk / ApacheSpark

Apache Spark machine learning project using pyspark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ApacheSpark

This repository introduces Pyspark by example and provides solutions to some machine learning consulting projects. In addition, a Spark streaming project is presented at the end.

NB. The Spark version 3.0.0 is used in this repository.

List of Pyspark materials:

Introduction to Pyspark RDD and DataFrame

Details of Pyspark DataFrame

How to setup Pyspark on Amazon AWS EC2

Introduction to Pyspark MLlib (Machine learning library)

Joining DataFrames in Pyspark

Machine learning projects using Pyspark ML library:

In this project, parameter tunning using CrossValidator is used. Also, categorical features are handled.

In this project, imbalanced data issue is resolved using weightCol in LogisticRegression. Also, a datetime feature is processed. StandardScaler was used to normalize each feature to unit standard deviation and zero mean.

This project focuses on feature importance computation. In this project, the imbalanced data issue is handled by using boosting techniques. In general, boosting algorithms are good choices for class imbalanced data.

For better results, one can use synthetic sampling methods like SMOTE and MSMOTE along with advanced boosting methods like Gradient boosting and XG Boost.

This project provides recommendation on movielens dataset based on collaborative filtering approach.

In this project, an SMS Spam detection is designed using spark NLP tools.

Introduction to Spark NLP tools along with some examples are presented here.

The design pipline includes: RegexTokenizer, StopWordsRemover, TF-IDF based feature extraction, Naive Bayes classifier.

Spark streaming:

This project creates an application that plots out the popularity of tags associated with incoming tweets streamed live from Twitter.

References:

[1] Apache Spark Documentation available at http://spark.apache.org/

[2] Kaggle open datasets available at https://www.kaggle.com/docs/datasets

[3] Spark and python for big data with pyspark, Udemy

[4] Advanced Analytics with Spark, 2nd Edition, Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills

About

Apache Spark machine learning project using pyspark


Languages

Language:Jupyter Notebook 99.5%Language:Python 0.5%