rarezhang / pyspark_training

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PySpark

Tutorial
Introduction
pyspark examples

Introduction

  • fast real-time processing framework
    • use HDFS (Hadoop Distributed File system)
    • can run on YARN
  • in-memory computations
  • perform:
    • stream processing in real-time
    • batch processing
    • interactive queries & interactive algorithms
  • written in Scala & support Python with PySpark

Configurations and parameters

  • provides configurations to run a Spark application
class pyspark.SparkConf (
   loadDefaults = True, 
   _jvm = None, 
   _jconf = None
)
  • setter methods: e.g., conf.setAppName("PySpark App").setMaster("local")
    • set(key, value) # To set a configuration property
    • setMaster(value) # To set the master URL
    • setAppName(value) # To set an application name
    • get(key, defaultValue=None) # To get a configuration value of a key
    • setSparkHome(value) # To set Spark installation path on worker nodes
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("PySpark App").setMaster("spark://master:7077")
sc = SparkContext(conf=conf)

SparkContext

  • the main function and SparkContext gets initiated
  • when we run any Spark application - a driver program starts
  • SparkContext uses Py4J to launch a JVM and creates a JavaSparkContext
  • PySpark has SparkContext available as 'sc'
class pyspark.SparkContext (
   master = None,  # the url of the cluster it connects to  
   appName = None,  # name of the job  
   sparkHome = None,  # spark installation directory  
   pyFiles = None,  # the .zip or .py files to send to the cluster and add to the PYTHONPATH  
   environment = None,  # worker nodes environment variables  
   batchSize = 0,  # The number of Python objects represented as a single Java object. Set 1 to disable batching, 0 to automatically choose the batch size based on object sizes, or -1 to use an unlimited batch size  
   serializer = PickleSerializer(),  # RDD serializer  
   conf = None,  # an object of L{SparkConf} to set all the Spark properties
   gateway = None,  # use an existing gateway and JVM, otherwise initializing a new JVM  
   jsc = None,  # the JavaSparkContext instance
   profiler_cls = <class 'pyspark.profiler.BasicProfiler'>  # a class of custom Profiler used to do profiling (the default is pyspark.profiler.BasicProfiler)  
)

spark_context.py

RDD

  • Resilient Distributed Dataset
  • RDD: Elements that run and operate on multiple nodes to do parallel processing on a cluster
  • to apply any operation in PySpark, need to create a PySpark RDD first

2 ways: apply operations on RDD:

Transformation

create a new RDD
e.g., filter, groupBy and map

Action

operations applied on RDD
instructs Spark to perform computation and send the result back to the driver

count.py # number of elements in the RDD is returned
collect.py # all the elements in the RDD are returned
foreach # Returns only those elements which meet the condition of the function inside foreach
filter.py # a new RDD is returned containing the elements, which satisfies the function inside the filter
map.py # a new RDD is returned: applying a function to each element in the RDD
reduce.py # After performing the specified commutative and associative binary operation, the element in the RDD is returned
join.py # returns RDD with a pair of elements with the matching keys and all the values for that particular key
cache.py # Persist this RDD with the default storage level (MEMORY_ONLY). You can also check if the RDD is cached or not.

Parallel processing

  • Spark uses shared variables for parallel processing

2 types of shared variables:

Broadcast

  • save the copy of data across all nodes
  • cached on all the machines and not sent on machines with tasks
class pyspark.Broadcast (
   sc = None, 
   value = None, 
   pickle_registry = None, 
   path = None
)

broadcast.py # how to use a Broadcast variable

Accumulator

  • aggregating the information through associative and commutative operations
  • use an accumulator for a sum operation or counters (in MapReduce)
class pyspark.Accumulator(aid, value, accum_param)

accumulator.py

Files

  • sc.addFile(path_to_file) : upload files (sc = SparkContext)
  • SparkFiles.get(file_name) : get the path on a worker
  • SparkContext.addFile(path_to_file) : resolve the paths to files
    • get(filename) # specifies the path of the file that is added through SparkContext.addFile()
    • getRootDirectory() : specifies the path to the root directory, which contains the file that is added through the SparkContext.addFile()

sparkfile.py

Storage

  • how RDD should be stored:
    • memory
    • disk
    • both
    • whether to serialize RDD and whether to replicate RDD partitions
class pyspark.StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication = 1)
  • storage level:
    DISK_ONLY = StorageLevel(True, False, False, False, 1)
    DISK_ONLY_2 = StorageLevel(True, False, False, False, 2)
    MEMORY_AND_DISK = StorageLevel(True, True, False, False, 1)
    MEMORY_AND_DISK_2 = StorageLevel(True, True, False, False, 2)
    MEMORY_AND_DISK_SER = StorageLevel(True, True, False, False, 1)
    MEMORY_AND_DISK_SER_2 = StorageLevel(True, True, False, False, 2)
    MEMORY_ONLY = StorageLevel(False, True, False, False, 1)
    MEMORY_ONLY_2 = StorageLevel(False, True, False, False, 2)
    MEMORY_ONLY_SER = StorageLevel(False, True, False, False, 1)
    MEMORY_ONLY_SER_2 = StorageLevel(False, True, False, False, 2)
    OFF_HEAP = StorageLevel(True, True, True, False, 1)

storagelevel.py

Machine Learning API

  • MLlib
    mllib.classification - supports various methods for binary classification, multiclass classification and regression analysis. e.g., Random Forest, Naive Bayes, Decision Tree, etc.
    mllib.clustering - unsupervised learning problem: group subsets of entities with one another based on some notion of similarity.
    mllib.fpm - Frequent pattern matching: mining frequent items, itemsets, subsequences or other substructures that are usually among the first steps to analyze a large-scale dataset.
    mllib.linalg - MLlib utilities for linear algebra.
    mllib.recommendation - Collaborative filtering: fill in the missing entries of a user item association matrix.
    spark.mllib - It ¬currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. spark.mllib uses the Alternating Least Squares (ALS) algorithm to learn these latent factors.
    mllib.regression - Regression algorithms: find relationships and dependencies between variables. The interface for working with linear regression models and model summaries is similar to the logistic regression case.

recommend.py

Serializers

  • used for performance tuning (plays an important role in costly operations)
  • all data that is sent over the network or written to the disk or persisted in the memory should be serialized

2 serializer:

MarshalSerializer

  • Python’s Marshal Serializer
  • faster than PickleSerializer
  • supports fewer data types
class pyspark.MarshalSerializer  

PickleSerializer

  • Python’s Pickle Serializer
  • supports nearly any Python object
  • not be as fast as more specialized serializers
class pyspark.PickleSerializer

serializing.py # serialize the data using MarshalSerializer

About


Languages

Language:Python 100.0%