Tutorial
Introduction
pyspark examples
- fast real-time processing framework
- use HDFS (Hadoop Distributed File system)
- can run on YARN
- in-memory computations
- perform:
- stream processing in real-time
- batch processing
- interactive queries & interactive algorithms
- written in Scala & support Python with PySpark
- provides configurations to run a Spark application
class pyspark.SparkConf (
loadDefaults = True,
_jvm = None,
_jconf = None
)
- setter methods:
e.g., conf.setAppName("PySpark App").setMaster("local")
- set(key, value) # To set a configuration property
- setMaster(value) # To set the master URL
- setAppName(value) # To set an application name
- get(key, defaultValue=None) # To get a configuration value of a key
- setSparkHome(value) # To set Spark installation path on worker nodes
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("PySpark App").setMaster("spark://master:7077")
sc = SparkContext(conf=conf)
- the main function and SparkContext gets initiated
- when we run any Spark application - a driver program starts
- SparkContext uses Py4J to launch a JVM and creates a JavaSparkContext
- PySpark has SparkContext available as 'sc'
class pyspark.SparkContext (
master = None, # the url of the cluster it connects to
appName = None, # name of the job
sparkHome = None, # spark installation directory
pyFiles = None, # the .zip or .py files to send to the cluster and add to the PYTHONPATH
environment = None, # worker nodes environment variables
batchSize = 0, # The number of Python objects represented as a single Java object. Set 1 to disable batching, 0 to automatically choose the batch size based on object sizes, or -1 to use an unlimited batch size
serializer = PickleSerializer(), # RDD serializer
conf = None, # an object of L{SparkConf} to set all the Spark properties
gateway = None, # use an existing gateway and JVM, otherwise initializing a new JVM
jsc = None, # the JavaSparkContext instance
profiler_cls = <class 'pyspark.profiler.BasicProfiler'> # a class of custom Profiler used to do profiling (the default is pyspark.profiler.BasicProfiler)
)
- Resilient Distributed Dataset
- RDD: Elements that run and operate on multiple nodes to do parallel processing on a cluster
- to apply any operation in PySpark, need to create a PySpark RDD first
2 ways: apply operations on RDD:
create a new RDD
e.g., filter, groupBy and map
operations applied on RDD
instructs Spark to perform computation and send the result back to the driver
count.py # number of elements in the RDD is returned
collect.py # all the elements in the RDD are returned
foreach # Returns only those elements which meet the condition of the function inside foreach
filter.py # a new RDD is returned containing the elements, which satisfies the function inside the filter
map.py # a new RDD is returned: applying a function to each element in the RDD
reduce.py # After performing the specified commutative and associative binary operation, the element in the RDD is returned
join.py # returns RDD with a pair of elements with the matching keys and all the values for that particular key
cache.py # Persist this RDD with the default storage level (MEMORY_ONLY). You can also check if the RDD is cached or not.
- Spark uses shared variables for parallel processing
2 types of shared variables:
- save the copy of data across all nodes
- cached on all the machines and not sent on machines with tasks
class pyspark.Broadcast (
sc = None,
value = None,
pickle_registry = None,
path = None
)
broadcast.py # how to use a Broadcast variable
- aggregating the information through associative and commutative operations
- use an accumulator for a sum operation or counters (in MapReduce)
class pyspark.Accumulator(aid, value, accum_param)
sc.addFile(path_to_file)
: upload files (sc = SparkContext)SparkFiles.get(file_name)
: get the path on a workerSparkContext.addFile(path_to_file)
: resolve the paths to filesget(filename)
# specifies the path of the file that is added through SparkContext.addFile()getRootDirectory()
: specifies the path to the root directory, which contains the file that is added through the SparkContext.addFile()
- how RDD should be stored:
- memory
- disk
- both
- whether to serialize RDD and whether to replicate RDD partitions
class pyspark.StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication = 1)
- storage level:
DISK_ONLY = StorageLevel(True, False, False, False, 1)
DISK_ONLY_2 = StorageLevel(True, False, False, False, 2)
MEMORY_AND_DISK = StorageLevel(True, True, False, False, 1)
MEMORY_AND_DISK_2 = StorageLevel(True, True, False, False, 2)
MEMORY_AND_DISK_SER = StorageLevel(True, True, False, False, 1)
MEMORY_AND_DISK_SER_2 = StorageLevel(True, True, False, False, 2)
MEMORY_ONLY = StorageLevel(False, True, False, False, 1)
MEMORY_ONLY_2 = StorageLevel(False, True, False, False, 2)
MEMORY_ONLY_SER = StorageLevel(False, True, False, False, 1)
MEMORY_ONLY_SER_2 = StorageLevel(False, True, False, False, 2)
OFF_HEAP = StorageLevel(True, True, True, False, 1)
- MLlib
mllib.classification - supports various methods for binary classification, multiclass classification and regression analysis. e.g., Random Forest, Naive Bayes, Decision Tree, etc.
mllib.clustering - unsupervised learning problem: group subsets of entities with one another based on some notion of similarity.
mllib.fpm - Frequent pattern matching: mining frequent items, itemsets, subsequences or other substructures that are usually among the first steps to analyze a large-scale dataset.
mllib.linalg - MLlib utilities for linear algebra.
mllib.recommendation - Collaborative filtering: fill in the missing entries of a user item association matrix.
spark.mllib - It ¬currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. spark.mllib uses the Alternating Least Squares (ALS) algorithm to learn these latent factors.
mllib.regression - Regression algorithms: find relationships and dependencies between variables. The interface for working with linear regression models and model summaries is similar to the logistic regression case.
- used for performance tuning (plays an important role in costly operations)
- all data that is sent over the network or written to the disk or persisted in the memory should be serialized
2 serializer:
- Python’s Marshal Serializer
- faster than PickleSerializer
- supports fewer data types
class pyspark.MarshalSerializer
- Python’s Pickle Serializer
- supports nearly any Python object
- not be as fast as more specialized serializers
class pyspark.PickleSerializer
serializing.py # serialize the data using MarshalSerializer