YLTsai0609 / pyspark_101

Yu Long's note about spark and pyspark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pyspark 101

Polish up your data processing skill using pyspark!

Installation

check here to install spark 3.0+

Marathon

This repo contains 50+ example scripts, 100+ minimum pyspark processing examples so far.

The tutorial is from spark-examples/pyspark-examples

The notebook is a cheatsheet contains 60+ problem and pyspark solutions

Pyspark basic

Content ID Date Content Note
001 1/11 hello_world
002 1/12 create_spark_session
003 1/12 accumulator
004 1/13 RDD creation
005 1/13 RDD pararllelization Repartition() vs Coalesce()
006 1/18 RDD operations - transformations (from 006 - 0064)
007 2/8 cluster managers
008 2/22 spark UI
009 2/23 RDD shuffle
009 2/23 RDD persist
010 3/9 Broadcasting

Pyspark DataFrame

Content ID Date Content Note
d001 1/18 create_dataframe (from d001 - d0012)
d0011 1/18 create_dataframe_csv
d0012 1/18 create_dataframe_json
d002 1/18 create_empty_dataframe
d003 1/18 spark_frame_to_pandas_frame
d004 1/20 structType/structField from d004 - d0042
d005 1/20 Row object d005
d006 1/20 select column from dataframe
d007 1/26 retreve_data_from_dataframe
d008 1/26 add, update, drop column in a dataframe
d009 1/27 filter rows
d010 1/27 filter null
d011 1/27 drop_na
d012 1/27 drop_duplicated
d013 1/27 sorting
d014 2/8 groupby, pivot from d014 to d 0141
d015 2/8 join
d016 2/8 union
d017 2/9 udf
d018 2/9 flatmap
d019 2/9 map
d020 2/13 sampling
d021 2/13 aggregation
d022 2/13 add_month
d023 2/13 split
d024 2/23 regular expression on pyspark dataframe
d025 3/1 extract img src tag in html by pyspark

PySpark data processing package

Content ID Date Content Note
p001 2/13 spark-df-profiling setup doc on pkg/p001
p002 5/20 graphframes

Concept

Content ID Date Content Note
001 1/21 MapReduce
002 1/26 Introduction to Spark(I) - rdd ops, shuffle and stage revisited 4/13
003 2/14 Apache Parquet 2.0
004 2/16 Introduction to Parquet
005 4/13 Introduction to Spark(II) - Driver, Executor, Application, ...
006 4/27 spark join I
007 4/27 spark join II
008 detect data skew in sparkUI
009 7/21 Spark OOM

Terminology

  • rdd
  • repartition/coalesce
  • map-reduce
  • yarn
  • mesos
  • parquet

Optimizing Technique

Additional

Graph Algorithm on Spark

Content ID Date Content Note
001 0520 why graph? why spark

Reference

kenttw/spark_tutorial

spark-examples/pyspark-examples

spark python api documentation 3.0.1

pandas 101 from yulong's note

Apache Parquet 2.0

Learning Apache Spark with Python

pyspark cheatsheet

2017 - Optimizing Apache Spark SQL Joins: Spark Summit East talk by Vida Ha

2019 - Optimizing Apache Spark SQL at LinkedIn

About

Yu Long's note about spark and pyspark


Languages

Language:HTML 59.9%Language:Jupyter Notebook 37.7%Language:Python 2.4%Language:Makefile 0.0%