PySpark

Introduction

fast real-time processing framework
- use HDFS (Hadoop Distributed File system)
- can run on YARN
in-memory computations
perform:
- stream processing in real-time
- batch processing
- interactive queries & interactive algorithms
written in Scala & support Python with PySpark

Configurations and parameters

provides configurations to run a Spark application

class pyspark.SparkConf (
   loadDefaults = True, 
   _jvm = None, 
   _jconf = None
)

setter methods: e.g., conf.setAppName("PySpark App").setMaster("local")
- set(key, value) # To set a configuration property
- setMaster(value) # To set the master URL
- setAppName(value) # To set an application name
- get(key, defaultValue=None) # To get a configuration value of a key
- setSparkHome(value) # To set Spark installation path on worker nodes

from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("PySpark App").setMaster("spark://master:7077")
sc = SparkContext(conf=conf)

SparkContext

the main function and SparkContext gets initiated
when we run any Spark application - a driver program starts
SparkContext uses Py4J to launch a JVM and creates a JavaSparkContext
PySpark has SparkContext available as 'sc'

class pyspark.SparkContext (
   master = None,  # the url of the cluster it connects to  
   appName = None,  # name of the job  
   sparkHome = None,  # spark installation directory  
   pyFiles = None,  # the .zip or .py files to send to the cluster and add to the PYTHONPATH  
   environment = None,  # worker nodes environment variables  
   batchSize = 0,  # The number of Python objects represented as a single Java object. Set 1 to disable batching, 0 to automatically choose the batch size based on object sizes, or -1 to use an unlimited batch size  
   serializer = PickleSerializer(),  # RDD serializer  
   conf = None,  # an object of L{SparkConf} to set all the Spark properties
   gateway = None,  # use an existing gateway and JVM, otherwise initializing a new JVM  
   jsc = None,  # the JavaSparkContext instance
   profiler_cls = <class 'pyspark.profiler.BasicProfiler'>  # a class of custom Profiler used to do profiling (the default is pyspark.profiler.BasicProfiler)  
)

spark_context.py

RDD

Resilient Distributed Dataset
RDD: Elements that run and operate on multiple nodes to do parallel processing on a cluster
to apply any operation in PySpark, need to create a PySpark RDD first

2 ways: apply operations on RDD:

Transformation

create a new RDD
e.g., filter, groupBy and map

Action

operations applied on RDD
instructs Spark to perform computation and send the result back to the driver

count.py # number of elements in the RDD is returned
collect.py # all the elements in the RDD are returned
foreach # Returns only those elements which meet the condition of the function inside foreach
filter.py # a new RDD is returned containing the elements, which satisfies the function inside the filter
map.py # a new RDD is returned: applying a function to each element in the RDD
reduce.py # After performing the specified commutative and associative binary operation, the element in the RDD is returned
join.py # returns RDD with a pair of elements with the matching keys and all the values for that particular key
cache.py # Persist this RDD with the default storage level (MEMORY_ONLY). You can also check if the RDD is cached or not.

Parallel processing

Spark uses shared variables for parallel processing

2 types of shared variables:

Broadcast

save the copy of data across all nodes
cached on all the machines and not sent on machines with tasks

class pyspark.Broadcast (
   sc = None, 
   value = None, 
   pickle_registry = None, 
   path = None
)

broadcast.py # how to use a Broadcast variable

Accumulator

aggregating the information through associative and commutative operations
use an accumulator for a sum operation or counters (in MapReduce)

class pyspark.Accumulator(aid, value, accum_param)

accumulator.py

Files

sc.addFile(path_to_file) : upload files (sc = SparkContext)
SparkFiles.get(file_name) : get the path on a worker
SparkContext.addFile(path_to_file) : resolve the paths to files
- get(filename) # specifies the path of the file that is added through SparkContext.addFile()
- getRootDirectory() : specifies the path to the root directory, which contains the file that is added through the SparkContext.addFile()

sparkfile.py

Storage

how RDD should be stored:
- memory
- disk
- both
- whether to serialize RDD and whether to replicate RDD partitions

class pyspark.StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication = 1)

storage level:
DISK_ONLY = StorageLevel(True, False, False, False, 1)
DISK_ONLY_2 = StorageLevel(True, False, False, False, 2)
MEMORY_AND_DISK = StorageLevel(True, True, False, False, 1)
MEMORY_AND_DISK_2 = StorageLevel(True, True, False, False, 2)
MEMORY_AND_DISK_SER = StorageLevel(True, True, False, False, 1)
MEMORY_AND_DISK_SER_2 = StorageLevel(True, True, False, False, 2)
MEMORY_ONLY = StorageLevel(False, True, False, False, 1)
MEMORY_ONLY_2 = StorageLevel(False, True, False, False, 2)
MEMORY_ONLY_SER = StorageLevel(False, True, False, False, 1)
MEMORY_ONLY_SER_2 = StorageLevel(False, True, False, False, 2)
OFF_HEAP = StorageLevel(True, True, True, False, 1)

storagelevel.py

Machine Learning API

MLlib
mllib.classification - supports various methods for binary classification, multiclass classification and regression analysis. e.g., Random Forest, Naive Bayes, Decision Tree, etc.
mllib.clustering - unsupervised learning problem: group subsets of entities with one another based on some notion of similarity.
mllib.fpm - Frequent pattern matching: mining frequent items, itemsets, subsequences or other substructures that are usually among the first steps to analyze a large-scale dataset.
mllib.linalg - MLlib utilities for linear algebra.
mllib.recommendation - Collaborative filtering: fill in the missing entries of a user item association matrix.
spark.mllib - It ¬currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. spark.mllib uses the Alternating Least Squares (ALS) algorithm to learn these latent factors.
mllib.regression - Regression algorithms: find relationships and dependencies between variables. The interface for working with linear regression models and model summaries is similar to the logistic regression case.

recommend.py

Serializers

used for performance tuning (plays an important role in costly operations)
all data that is sent over the network or written to the disk or persisted in the memory should be serialized

2 serializer:

MarshalSerializer

Python’s Marshal Serializer
faster than PickleSerializer
supports fewer data types

class pyspark.MarshalSerializer

PickleSerializer

Python’s Pickle Serializer
supports nearly any Python object
not be as fast as more specialized serializers

class pyspark.PickleSerializer

serializing.py # serialize the data using MarshalSerializer

rarezhang / pyspark_training