an100 / Spark

Apache Spark (Scala, PySpark, SparkR) Code, Tricks, and References

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tips and Tricks

This repo contains a random collection of Spark code, written mostly in python (using the PySpark API). I have also included code/scripts in Scala and SparkR. Feel free to copy and use as-in. Let me know if you have any questions or feedback regarding any of the code.

Zeppelin Notebook Hub (can be used to view Zeppelin notebooks, in json format): https://www.zeppelinhub.com/viewer/

Spark Tuning & Best Practices Reference: https://github.com/zaratsian/HDP_Tuning_Unofficial
Spark Tuning Tool: https://github.com/zaratsian/Spark/blob/master/spark_tuning_tool.py

Machine Learning Cheatsheets:
    • SAS - ML Algorithms
    • SKLearn - Choosing the right estimator
    • MS Azure - ML Algorithms

References:
    • Apache Spark Quickstart
    • Spark PySpark (Python) API
    • Databricks - Guide
    • Databricks - Developer Resources
    • Spark Tuning Guide
    • Spark Tuning - Garbage Collection
    • Hortonworks - Spark Reference
    • Anaconda Hortonworks Management Packs
    • Apache Spark - Best Practices & Tuning

About

Apache Spark (Scala, PySpark, SparkR) Code, Tricks, and References


Languages

Language:Jupyter Notebook 97.1%Language:Python 2.6%Language:R 0.4%