shah-zeb-naveed / distributed-computing-pyspark

Distributed Computing using PySpark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Distributed Computing - PySpark

This repository contains mini-projects on distributed computing using Spark in Python.

  1. Text Analytics: Point-wise Mutual Information in PySpark

Calculates the PMI for a token or a pair of tokens for all the words ocurring in a text file.

  1. Graph/Network Analysis: Personalized PageRank Algorithm in PySpark

Implements a modified version of the PageRank Algorithm in which the ranking is performed in reference to a given source node. The modifications are two-fold: A. Random Jumps only to the source node B. Lost mass due to dangling nodes is transferred completely to the source node instead of redistrubuting over the entire graph

  1. Querying TPCH with spark dataframes and spark sql

  2. Stochastic Gradient Descent using Spark from scratch for email classification into spam/ham.

  3. Analysis of live robot movements data using Sparks Streaming.

About

Distributed Computing using PySpark


Languages

Language:Jupyter Notebook 99.9%Language:Python 0.1%