last updated: Feb 13, 2021
pyspark_xray is a diagnostic tool, in the form of Python library, for pyspark developers to debug and troubleshoot PySpark applications locally, specifically it enables local debugging of PySpark RDD or DataFrame transformation functions that runs on slave nodes.
The purpose of developing pyspark_xray is to create a development framework that enables PySpark application developers to debug and troubleshoot locally and do production runs remotely using the same code base of a pyspark application. For the part of debugging Spark application code locally, pyspark_xray specifically provides capability of locally debugging Spark application code that runs on slave nodes, the missing of this capability is an unfilled gap for Spark application developers right now.
For developers, it’s very important to do step-by-step debugging of every part of an application locally in order to diagnose, troubleshoot and solve problems during development.
If you develop PySpark applications, you know that PySpark application code is made up of two categories:
-
code that runs on master node
-
code that runs on worker/slave nodes
While code on master node can be accessed by a debugger locally, code on slave nodes is like a blackbox and not accessible locally by debugger.
Plenty tutorials on web have covered steps of debugging PySpark code that runs on master node, but when it comes to debugging PySpark code that runs on slave nodes, no solution can be found, most people refer to this part of code either as a blackbox or no need to do debugging.
Spark code that runs on slave nodes includes but is not limited to: lambda functions that are passed as input parameter to RDD transformation functions.
pyspark_xray library enables developers to locally debug (step into) 100% of Spark application code, not only code that runs on master node, but also code that runs on slave nodes, using PyCharm and other popular IDE such as VSCode.
This library achieves these capabilties by using the following techniques:
-
wrapper functions of Spark code on slave nodes, check out the section below to learn more details
-
practice of sampling input data under local debugging mode in order to fit the application into memory of your standalone local PC/Mac
-
For exmple, say your production input data size has 1 million rows, which obviously cannot fit into one standalone PC/Mac’s memory, in order to use pyspark_xray, you may take 100 sample rows as input data as input to debug your application locally using pyspark_xray
-
-
usage of a flag to auto-detect local mode,
CONST_BOOL_LOCAL_MODE
from pyspark_xray’s const.py auto-detects whether local mode is on or off based on current OS, with values:-
True: if current OS is Mac or Windows
-
False: otherwise
-
in your Spark code base, you can locally debug and remotely execute your Spark application using the same code base.
pyspark_xray introduces these wrapper functions (defined in utils_m.py) of Spark native functions to enable local debugging of code that runs on slave nodes:
RDD.map function returns a new RDD by applying a function to each element of the RDD.
wrapper_rdd_map
wraps RDD.map function so that it runs the original function when CONST_BOOL_LOCAL_MODE
is False, and runs locally debuggable re-written version when CONST_BOOL_LOCAL_MODE
is True.
RDD.mapValues function passes each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD’s partitioning.
wrapper_rdd_mapvalues
wraps RDD.mapValues function so that it runs the original function when CONST_BOOL_LOCAL_MODE
is False, and runs locally debuggable re-written version when CONST_BOOL_LOCAL_MODE
is True.
Like RDD.mapValues, wrapper_rdd_mapvalues wrapper is used on key value pair type of RDDs.
DataFrame.mapInPandas function maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a pandas DataFrame, and returns the result as a DataFrame.
wrapper_sdf_mapinpandas
wraps Spark DataFrame.mapInPandas function so that it runs the original function when CONST_BOOL_LOCAL_MODE
is False, and runs locally debuggable re-written version when CONST_BOOL_LOCAL_MODE
is True.
Check out demo_app02 for a demo of this wrapper, refer to demo apps section to learn how to setup.
-
demo_app demonstrates a sample use case of
wrapper_rdd_map
andwrapper_rdd_mapvalues
-
demo_app02 demonstrates a sample use case of
wrapper_sdf_mapinpandas
The demo_app folder contains a simple PySpark application that takes data of 11 loans into a Spark RDD, then calculates interest amount for each loan by passing calc_mthly_payment
and calc_interest
lambda functions respectively to 2 types of RDD mapValues
transformation functions to calculate monthly payment amount and total interest amount for each loan respectively, see input and output below
ingested 11 loans
input =
+-------+--------+-----+----------+
|loan_id|loan_amt| apr|term_years|
+-------+--------+-----+----------+
| 300| 15000.0|0.054| 6|
| 301| 27000.0|0.034| 6|
| 302| 33000.0|0.053| 5|
| 303| 45000.0|0.035| 5|
| 304| 56000.0|0.033| 7|
| 305| 44000.0|0.032| 4|
| 306| 25000.0|0.043| 5|
| 307| 26000.0|0.023| 7|
| 308| 35200.0|0.034| 6|
| 309| 57000.0|0.055| 5|
| 310| 45300.0|0.034| 5|
+-------+--------+-----+----------+
payment output =
+-------+---------------+
|loan_id|monthly_pmt_amt|
+-------+---------------+
| 300| 244.37|
| 301| 415.08|
| 302| 627.3|
| 303| 818.63|
| 304| 747.54|
| 305| 977.8|
| 306| 463.81|
| 307| 335.4|
| 308| 541.14|
| 309| 1088.77|
| 310| 822.06|
+-------+---------------+
interest output =
+-----------------+-------+
| interest_amt|loan_id|
+-----------------+-------+
| 4860.0| 300|
|5508.000000000001| 301|
| 8745.0| 302|
|7875.000000000001| 303|
| 12936.0| 304|
| 5632.0| 305|
| 5375.0| 306|
| 4186.0| 307|
|7180.800000000001| 308|
| 15675.0| 309|
| 7701.0| 310|
+-----------------+-------+
-
const.py, defines a couple of variables shared by multiple modules
-
driver.py, main entry point of Spark application, creates SparkConf and SparkSession objects and triggers Spark application
-
main.py, the backbone of Spark application, runs on master node, read data into RDD, then perform RDD transformations, then print output
-
utils_m.py, utility functions that run on master node, mainly for data ingestion purpose
-
utils_s.py, lambda functions that are passed as parameters to RDD transformation functions, run on slave nodes
as of Febuary 2021
-
pyspark_xray (this package)
-
spark: v3.0.1
-
pyspark: v3.0.1
-
java: v1.8.0
-
PyCharm: Community v2020.3
-
Open command line, kick off
java
command, if you get an error, then download and install java (version 1.8.0_221 as of April 2020) -
If you don’t have it, download and install PyCharm Community edition (version 2020.1 as of April 2020)
-
If you don’t have it, download and install Anaconda Python 3.7 runtime
-
Download and install spark latest Pre-built for Apache Hadoop (spark-2.4.5-bin-hadoop2.7 as of April 2020, 200+MB size) locally
-
Windows:
-
if you don’t have unzip tool, please download and install 7zip, a free tool to zip/unzip files
-
extract contents of spark tgz file to c:\spark-x.x.x-bin-hadoopx.x folder
-
follow the steps in this tutorial
-
install
winutils.exe
intoc:\spark-x.x.x-bin-hadoopx.x\bin
folder, without this executable, you will run into error when writing engine output
-
-
-
Mac:
-
extract contents of spark tgz file to \Users\[USERNAME]\spark-x.x.x-bin-hadoopx.x folder
-
-
-
install pyspark by
pip install pyspark
orconda install pyspark
, make sure version of pyspark match the version of spark you use
You run Spark application on a cluster from command line by issuing spark-submit
command which submit a Spark job to the cluster. But from PyCharm or other IDE on a local laptop or PC, spark-submit
cannot be used to kick off a Spark job. Instead, follow these steps to set up a Run Configuration of pyspark_xray’s demo_app on PyCharm
-
Set Environment Variables:
-
set
HADOOP_HOME
value toC:\spark-x.x.x-bin-hadoop2.7
-
set
SPARK_HOME
value toC:\spark-x.x.x-bin-hadoop2.7
-
-
use Github Desktop or other git tools to clone
pyspark_xray
from Github -
PyCharm > Open pyspark_xray as project
-
Open PyCharm > Run > Edit Configurations > Defaults > Python and enter the following values:
-
Environment variables (Windows):
PYTHONUNBUFFERED=1;PYSPARK_PYTHON=python;PYTHONPATH=$SPARK_HOME/python;PYSPARK_SUBMIT_ARGS=pyspark-shell;
-
-
Open PyCharm > Run > Edit Configurations, create a new Python configuration, point the script to the path of
driver.py
of pyspark_xray > demo_app (see screen shot below)
In main.py, after loan data is ingested into RDD, two types of RDD transformation functions are called one after the other to demonstrate difference of debugging capability between pyspark_xray’s RDD transformation wrappers vs native RDD transformation functions.
At first, native RDD mapValues
transformation function is called with calc_mthly_interest
as lambda function parameter
rdd_pmt = loan_rdd.mapValues(lambda x: utils_slave.calc_mthly_payment(row=x))
Then pyspark_xray’s wrapper function of RDD mapValues
transformation function, i.e. wrapper_mapvalues
function, is called with calc_interest
as lambda function parameter
rdd_int = utils_debugger.wrapper_rdd_mapvalues(input_rdd=loan_rdd
, func=lambda x: utils_slave.calc_interest(row=x)
, spark_session=self.spark_session
, debug_flag=const_xray.CONST_BOOL_LOCAL_MODE)
Correspondingly, break points are set within calc_mthly_payment
and calc_interest
lambda functions respectively in utils_s.py. NOTE: these are break points that were not stoppable before adopting pyspark_xray.
Now start debugging demo_app and the break point set in calc_mthly_payment
function will be skipped, but break point in calc_interest
function will be stopped, see below. The reason is because calc_interest
lamdba function was passed to pyspark_xray’s wrapper function of RDD mapValues
transformation, while calc_mthly_payment
function was passed to original RDD mapValues
transformation.
PySpark Resources:
-
pyspark topic on Github
-
pyspark pycharm debugging google search
-
local debug scala spark google search
-
another pyspark tuning tool: drpyspark